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Preface 


The  safe  operation  of  computer  systems  continues  to  be  a  key  issue 
in  many  applications  where  people,  environment,  investment,  or 
goodwill  can  be  at  risk.  Such  applications  include  medical,  railways, 
power  generation  and  distribution,  road  transportation,  aerospace, 
process  industries,  mining,  military  and  many  others. 

This  book  represents  the  proceedings  of  the  12th  International 
Conference  on  Computer  Safety,  Reliability  and  Security,  held  in 
Poznan,  Poland,  27-29  October  1993.  The  conference  reviews  the 
state  of  the  art,  experiences  and  new  trends  in  the  areas  of  computer 
safety,  reliability  and  security.  It  forms  a  platform  for  technology 
transfer  between  academia,  industry  and  research  institutions.  In 
an  expanding  world-wide  market  for  safe,  secure  and  reliable 
computer  systems  SAFECOMP93  provides  an  opportunity  for 
technical  developers,  users,  and  legislators  to  exchange  and  review 
the  experience,  to  consider  the  best  technologies  now  available  and 
to  identify  the  skills  and  technologies  required  for  the  future.  The 
papers  were  carefully  selected  by  the  International  Program  Com¬ 
mittee  of  the  Conference.  The  authors  of  the  papers  come  from  16 
different  countries.  The  subjects  covered  include  formal  methods 
and  models,  safety  assessment  and  analysis,  verification  and 
validation,  testing,  reliability  issues  and  dependable  software  tech¬ 
nology,  computer  languages  for  safety  related  systems,  reactive 
systems  technology,  security  and  safety  related  applications.  As  to 
its  wide  international  coverage,  unique  way  of  combining  partici¬ 
pants  from  academia,  research  and  industry  and  topical  coverage, 
SAFECOMP  is  outstanding  among  the  other  related  events  in  the 
field. 

The  reader  will  get  insight  into  the  basic  status  of  computer  safety, 
reliability  and  security  (through  invited  presentations)  and  will 
receive  a  representative  sample  of  recent  results  and  problems  in 
those  fields  presented  by  experts  from  both  industrial  and  academic 
institutions. 

The  response  to  the  Call  for  Papers  produced  many  more  good 
papers  than  could  be  included  in  tlie  programme.  1  must  thank  all 
the  authors  who  submitted  their  work,  the  presenters  of  the  papers. 


VI 


Preface 


the  International  Program  Committee  and  National  Organising 
Committee,  the  Sponsor  and  Co-sponsors  for  their  efforts  and 
support.  Through  their  strong  motivation  and  hard  work  the 
Conference  and  this  book  have  been  enabled. 


Janusz  Gorski 


Poznan,  Poland 
August  1993 
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INVITED  PAPER 


Safety  -  status  and  perspectives 


Tom  Andorson 

Dq[»rtinent  of  Compmiiig  Science 
The  Univeraty  of  Newcastle  upon  Tyne,  NEl  7RU,  UK 

Abstract 

Safety  can  be  all  tilings  to  all  men  -  that  is,  different  people  in 
different  situations  will,  quite  legitimately,  interpret  the  term 
“safety”  in  different  ways.  This  piqier  expresses  a  personal 
perqiective  on  safety  as  an  engineering  concern. 


1  Introduction 

Delegates  will  be  aware  that  this  is  the  12tii  occasion  of  presenting  SAFECX)MP,  a 
conference  which,  under  the  au^ices  of  EWICS  Technical  Committee  7,  has  laid 
stress  on  the  importance  of  safety  in  the  context  of  computing  systems  since  the 
very  first  SAFBCOMP  in  1979.  Consequently,  the  evoit  has  an  enviable  lineage 
widi  respect  to  a  topic  that  is  recognised  to  be  of  rapidly  increasing  significance, 
commensurate  with  the  growth  in  automatic  control  of  critical  applications.  It 
semns  inevitable  that  these  trends  will  cmitinue  and  accelerate,  given  current 
projections  for  tiie  semiconductcx'  and  telecommunication  industries.  Over  the  past 
IS  years,  work  on  both  research  and  system  develqiment  has  enhanced  our 
understanding  of  the  issues  and  techniques  relating  to  safety  in  computing  systems. 
However,  mudi  remains  to  be  done,  in  further  advancing  the  discipline  and  in  imxe 
widely  promulgating  the  current  state  of  the  art  In  this  brief  perspective  I  have 
taken  tiie  tqiportunity  to  make  some  elementary  observaticms  mi  the  tenets  of  safe 
computing  systems;  if  any  of  these  are  considered  provocative  or  unsound  I 
welcome  correction. 

2  Definition 

Because  “safe”  and  “safety”  are  wmds  in  everyday  use,  they  have  dictionary 
definitions  and  pcqiular  interpretations.  These  interpretations  can  differ  widely:  for 
the  general  public,  for  politicians,  for  inofessionals  (lawyers,  engineers,  regt^tors 
etc),  across  industrial  sectors,  and  over  time  (especially  after  a  major  accident).  A 
scientist  or  engineer  recognises  the  range  of  interpretations,  but  must  nevertheless 
adopt  a  specific  working  definition  -  and  thus  accepts  the  consequence  that  because 
othm  may  select  an  alternative  definition,  conflicts  may  need  to  be  resolved  if 
confusion  is  to  be  avoided. 

The  usual  starting  point  for  a  definition  of  safety  is  that  a  system  is  safe  if  it  will 
not  kill  anyone.  However,  numerous  points  then  need  clarification,  such  as  “what 
about  multi|tie  deaths?”,  “what  about  injuries,  severe  and  minor?”,  “what  about 
environmental  damage,  with  implications  for  human  well-being?”,  “what  about  vast 
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financial  losses,  with  implications  for  the  well-being  of  some?”.  An  (inadequate) 
escape  route  is  to  assert  that  a  system  is  safe  if  it  will  not  harm  anyone.  But  does 
this  mean  never  harm  anytme,  under  any  possible  circumstances?  Only  when  these, 
and  other,  questions  are  answered  would  we  have  a  semblance  of  a  definition. 
(Th^  is,  of  course,  no  single  “correct”  definition,  so  these  questions  will  not  be 
answered  here!)  One  way  forward  is  to  define  a  system  to  be  safe  if  it  will  not  cause 
an  accident,  thereby  postponing  (albeit  briefly)  the  defmition  of  what  constitutes  an 
accident  Even  given  an  agreed  definition  of  a  safe  system,  it  is  then  vital  to 
examine  how  degrees  of  unsafeness  should  be  characterised,  which  leads  on  to  the 
notion  of  risk  to  ciqiture  the  likelihood  and  magnitude  of  losses  incurred  through 
use  of  the  system. 

From  an  engineer’s  viewpoint,  ensuring  that  these  issues  are  addressed  and  resolved 
is  much  imve  imptwtant  than  the  details  of  their  resolution  in  a  particular  case. 

3  Misconceptions 

DesfHte,  or  pertiaps  because  of,  the  widespread  use  of  safety  concepts,  a  number  of 
misconceptions  are  frequently  encountered  in  the  wider  computing  community  - 
SAI^COMP  delegates  will,  1  trust,  concur  with  my  critique  of  the  following 
aberrant  assertions. 

(a)  S(rfety  is  paramount.  If  this  were  true,  then  in  almost  all  cases,  the  proper 
course  of  action  would  be  not  to  implement  the  system,  or  at  least  not  to 
operate  it  Safety  is  an  attribute  of  a  system  which  frequently  conflicts  with 
other  desirable  attributes.  The  design  engineer  has  the  difficult  task  of  striving 
to  achieve  the  optimum  compromise  between  safety  and  the  other  required 
characteristics  for  the  system,  all  within  budgetary  and  other  resource 
constraints. 

(b)  Safety  is  an  absolute.  The  notion  of  absolute  safety  can  be  formulated  and 
discussed  if  necessary,  but  the  real  engineering  issues  concern  levels  of  safety 
and  tradeoffs  between  safety  and  other  system  properties.  Consider  the 
following  questions:  How  safe  should  the  system  be  designed  to  be?  How 
unsafe  could  the  system  be  and  still  be  ccxisidered  adequately  safe?  How  safe  is 
the  implemented  system?  How  safe  has  the  system  been  during  operation?  By 
comparison  the  question  “Is  the  system  absolutely  safe?”  seems  poin^tless. 

(c)  Safety  can't  be  quantified  (less  extreme  versions:  safety  ought  not  to  be 
quantified;  avoid  quantifleation  in  safety  analyses).  On  the  contrary,  it  is 
essential  that  safety  be  quantified  -  to  the  extent  that  this  is  feasible,  and  fully 
acknowledging  the  limitations  and  imprecision  of  measurement  techniques. 
Quantifled  analysis  of  safety  should  be  viewed  as  the  normal  engineering  goal, 
and  conseqi^ntly  the  inability  to  quantify  safety  should  be  recognised  as  a 
deficioicy  -  in  which  case  subjective  rankings  or  objective  comparisons  may  be 
employed  as  a  weaker  alternative. 

(d)  Safety  must  be  guaranteed.  Since  safety  does  not  equate  to  death  or  taxes  such  a 
guarantee  must  be  regarded  as  a  forlorn  hope,  other  than  in  the  sense  of  a 
warranty  estaUishing  corporate  liability. 


ie)  Scfety  is  unique.  Safety  is  a  highly  signiHcant  system  attribute  because  of  the 
importance  we  rightly  attach  to  the  lives  of  others.  Nevertheless,  it  has  very 
much  in  common  wiA  other  system  attributes  such  as  reliability  and  security, 
and  safety  engineering  can  and  does  benefit  greatly  from  the  techniques 
developed  fw  other  aq)ects  of  dependability  in  systems  -  and  vice-versa  of 
course.  [A  personal  aside.  At  SAFECOMP’83  in  Cambridge  I  asserted  (as  a 
panellist)  that  the  concq)ts  of  safety  and  reliability  were  essentially  identical, 
differing  only  in  the  criterion  which  specified  success.  Although  I  still  believe 
this  to  be  true,  I  have  learned  a  little  in  the  last  ten  years,  and  do  not  expect  to 
reiterate  this  academic  and  potentially  misleading  observation  in  Poznan  at 
SAFECOMP’93.] 

4  Axioms 

In  contrast  to  the  above,  the  following  truths  are  held  to  be  self-evident. 

(a)  Sctfety  is  a  system  attribute.  This  is  sometimes  taken  to  imply  that  safety  is 
solely  a  (voperty  of  the  overall  application  system  (e.g.  nuclear  power  pl^t) 
operating  in  the  real-world  environment;  a  very  narrow  interpretation  then 
misleads  by  inferring  that  subsystems  do  not  have  this  property  (contradicted 
by  axioms  b  and  c  below).  A  more  generic  use  of  the  term  system  is  much  to 
be  preferred,  encompassing  subsystems,  units,  modules,  components  etc.,  in 
which  case  axiom  a  is  almost  tautological. 

(b)  Computing  systems  can  kill.  See  Leveson  and  Turner  [2]. 

(c)  Software  can  kill.  See  Leveson  and  Turner  [2].  Obviously,  the  software  directs 
the  computing  system  which  in  turn  acts  via  the  controlled  equipment  - 
analogously,  most  murderers  make  use  of  a  weapon. 

(d)  Perfection  is  unattcunable.  Samuel  Butler  advised  “Strive  for  imperfection  - 
there’s  some  change  of  getting  it”.  Dijkstra  warned  ’Testing  can  show  the 
presence,  but  never  the  absence  of  faults”.  Lebesgue  cautions  “Logic  makes  us 
reject  certain  arguments,  but  it  cannot  make  us  believe  any  argument”.  Juvenal 
asked  “But  who  is  to  guard  the  guards  themselves?”.  Brookes  summed  it  all  up 
-  “There  is  inherently  no  silver  bullet”. 

(e)  There's  scfety  in  numbers.  Although  this  is  a  well  known  English  phrase  it  is 
perhaps  a  little  too  ambiguous  to  be  axiomatic.  A  literal  interpretation  is 
unusual  and  the  benefits  of  quantification  have  already  been  suggested;  here  I 
wish  to  take  the  standard  usage,  which  suggests  that  members  within  a  group 
are  less  exposed  to  attack  than  isolated  individuals,  and  thereby  make  the 
standard  argument  in  favour  of  redundancy.  Any  single  entity  can  fail,  and  to 
avoid  a  single  point  of  failure  altonative  mechanisms  should  be  available  (eg. 
retry,  or  a  spare,  ot  diversity,  or  fail-safe). 

5  Engineering  Safe  Computing  Systems 

The  tasks  of  safety  engineering  are  clearly  manifold:  to  establish  the  safety 
requirements  for  the  system  and  its  subsystems,  to  formulate  safety  policies, 
^leciHcations  and  strategies,  to  design  for  safety,  to  conduct  hazard  and  safety 
analyses,  to  compose  the  safety  case  and  gain  certification  for  the  system,  to 
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implement,  install,  (^)enite  and  maintain  the  system  in  accmdance  with  all  of  the 
preceding.  All  are  vital  importance  Oitotally)*  which  makes  jxioritisation  rather 
difficulL  I  would  place  particular  emphasis  on  achieving  safety,  and  feel  that  the 
q)ecific  topics  of  requirements,  validation  and  fault-tolerance  deserve  special 
mention  -  but  this  may  merely  be  a  consequence  of  personal  prejudice.  In  any  case, 
the  above  list  of  topics  is  driven  by  system  life-cycle  stages,  and  we  should  also 
include:  management,  procedures,  documentation,  standards,  human  factors  and  reai- 
time  considnations. 

My  position  in  1989  was  stated  as: 

would  commend  three  attributes  to  those  invdved  in  the  construction  of 
[safe]  computing  systems.  First,  vigilance,  in  avoiding  and  eliminating 
faults;  second,  diversity,  to  fvovide  protection  against  the  consequences  of 
faults;  and  third,  simplicity,  the  hand-maiden  of  dependability”  [1]. 

Almost  five  years  on,  the  only  change  I  wish  to  make  is  to  reverse  the  ordering. 
Lastly,  I  would  like  to  refer  readers  to  the  most  enjoyable  text  on  system  safety  I 
have  encountered  [3],  which  happens  to  be  in  the  domain  of  railway  safety  and  the 
lesstHis  to  be  learnt  from  accidents;  as  well  as  being  highly  instructive,  the  book 
provides  this  closing  quotation  to  emphasise  that  even  safety  engineers  can  learn 
from  their  mistakes: 

Out  of  this  nettle.  Danger 
We  pluck  this  flower.  Safety 

Henry  N  (Part  I) 
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Abstract 

In  this  paper  a  methodology  to  develq)  safety-critical  control 
systems  is  iR‘(qx)sed.  These  systems  continuously  interact  with 
the  physical  environment,  and  those  admitting  at  least  (me  failure 
causing  a  catastrophe  are  classified  as  safety-critical.  Our 
methodology  takes  into  account  both  the  control  system 
(controller)  and  the  physical  environment  (plant).  After  the 
requirements  analysis,  the  system  is  developed  following  data 
flow  model,  i.e.,  described  as  a  static  data  flow  netwcnk  o(  nodes 
executing  concurrently  and  communicating  asynchronously.  The 
plant  is  used  as  the  test  case  fmr  the  validation  of  the  ccmtroller 
and  their  compositicm  is  analysed  to  show  whether  hazards  are 
reached.  To  this  purpose  we  apply  a  transfcxmation  from  data 
flow  networks  to  LOTOS  specifications.  The  transformation 
(veserves  the  semantics  of  the  (xiginal  network  and  data  flow 
netwcxk  (Hoperties  can  be  derived  and  proved  on  the  LOTOS 
specificaticm  using  available  support  tools.  A  train  set  example 
for  the  c(mtact-free  moving  of  trains  cm  a  circular  track  divided 
into  sectitms  is  shown  as  an  applicaticm  of  the  metlKxkriogy. 


1  Introduction 

Ccmtrol  systems  are  computing  systems  which  continuously  interact  with  the 
physical  envircmment,  e.g.  traffic  ccmtrol  or  industrial  process  control  systems. 
Mmy  cxmtrol  systems  are  safety-critical,  i.e.  systems  for  which  at  least  one  failure 
exists  that  can  cause  a  catastrophe.  Therefore,  in  addition  to  their  functicmal 
aqnbilities,  these  systems  require  ^lecified  levels  of  dq;)endability.  In  the  framework 
of  safety-critical  systems,  one  ai^xoach  to  iminove  the  level  of  dependability  is  to 
use  formal  q)ecincation  and  verification  in  conjunction  with  other  meduxls  of 
software  development  such  as  testing  and  fault  tolerance.  The  analysis  of  the  critical 
issues  of  a  ccmtrol  syriem  plays  a  vital  role  in  the  development  of  safety-critical 
systems.  Critical  issues  address  what  the  system  should  not  do  and  allow  to 
concentrate  cm  the  elimination  and  ccrntrol  of  the  hazards.  The  study  of  the  critical 
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issues  of  the  system,  allows  us  to  derive  the  constraints  necessary  to  guarantee  a 
safe  bdiaviour  of  the  system  (safety  constraints)  and  the  strategies  to  realise  it 
(safety  strategies)  [1].  The  validation  phase  is  as  important  as  requirements  analysis. 
Validiation  is  the  activity  that  aims  to  check  that  the  actual  behaviour  of  the 
devdoped  system  is  as  expected. 

Data  flow  is  a  paradigm  for  concurrent  computations.  A  data  flow  network  is 
composed  by  a  set  of  nodes  (or  processes)  all  executing  concurrently  and 
asynchronously.  They  communicate  by  exchanging  messages,  representing  data 
items,  over  asynchronous  communication  channels  (following  a  FIFO  policy).  The 
computation  proceeds  in  a  data  driven  manner:  a  node  of  the  network  is  ready  to 
execute  as  soon  as  the  required  data  tokens  are  available.  Data  flow  is  receiving  great 
attention  being  known  for  its  suitability  for  achieving  a  high  degree  of  execution 
parallelism,  thus  allowing  to  improve  performance,  but  has  other  useful 
characteristics  as  well.  A  data  flow  network  is  usually  very  close  to  the  intuitive 
representation  of  a  control  system,  that  is  the  translation  finom  the  conceived  system 
to  a  data  flow  grtq)h  is  straightftxward,  as  well  as  to  inspect  the  Hata  flow  graph  to 
determine  which  aspects  of  the  system  are  rqnesented  [2],  [3].  This  makes  data  flow 
generally  recognised  as  a  convenient  (Hogramming  par^gm  for  the  development  of 
control  systems.  The  referential  transparmcy  im)per^  admitted  when  nodes  compute 
functions,  by  which  two  executions  of  Um  same  node  with  the  same  input  data 
produce  equal  output  results,  makes  data  flow  "inherently  fault  tolerant":  it  is 
possible  to  tolerate  simple  failures  by  re-evaluating  the  same  function  on  the  same 
input  data  [4],  [S].  If  a  non  deterministic  behaviour  of  nodes  is  allowed,  still  the 
strong  isolation  and  infcvmation  hiding  enforces  a  good  conflnement  useful  for 
setting  error  confinement  areas  around  modules  by  means  of  sq>propriate  consistency 
checks.  The  property  of  composability  which  puts  in  direct  relation  the  genend 
behaviour  of  a  system  firom  its  constituent  parts  [6],  [7]  helps  verification  and 
validation.  Lastly,  structural  models  for  software  reliability  assessment  can  be 
applied  since  all  data  necessary  to  their  use  can  be  obtained  by  a  simple 
instrumentation  of  software  code  [8]. 

In  this  paper  a  systems  development  methodology  is  proposed.  After  the 
requirements  analysis,  the  system  is  developed  following  the  computational  model 
ba^  on  the  Jonsson's  formalism  [7].  In  the  validation  phase,  the  specification  of 
the  physical  environment  is  assumed  as  the  test  case  for  the  control  system:  the 
plant  and  the  controller  are  composed  and  the  resulting  behaviour  is  analysed  to  be 
sure  that  hazards  are  never  reached  in  the  system.  To  this  purpose,  we  apply  a 
transformation  from  data  flow  networks  to  LOTOS  (Language  Of  Temporal 
Ordoing  Specification)  [9]  specifications.  The  transformation  maintains  the  data 
flow  network  properties  which  can  be  derived  and  proved  on  the  LOTOS 
specification.  Available  LOTOS  software  sui^rt  tools  are  then  used  [10].  The 
i^equacy  of  the  proposed  methodology  is  shown  through  the  design  and  the 
vali^tion  of  a  simple  control  system:  a  train  set  example  for  the  contact-flee 
moving  of  trains  on  a  circular  track  divided  into  sections  [1],  [1 1].  The  rest  of  this 
papm*  is  as  follows.  Section  2  is  devoted  to  the  deflnition  of  our  methodology, 
including  a  descriptitm  of  the  data  flow  fmmalism  adopted,  the  transformation  and 
its  properties.  Section  3  develt^  the  example  of  the  train  set  to  show  how  the 
methodology  can  be  a|q)lied.  La^y,  Section  4  contains  our  conclusion. 
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2  System  Development  Methodology 


The  proposed  develqMnent  methodology  takes  into  account  the  parallel  interaction 
between  a  plant  and  a  controller  which  must  eliminate  unsatisfactt»7  behaviours  of 
the  plant  The  interface  between  the  plant  and  the  controller  contains  sensors  and 
actuates.  Senses  detect  events  in  the  plant  and  send  signals  to  the  controller.  Upon 
reception  of  the  signals  the  controller  can  take  actions  by  issuing  appropriate  control 
commands  through  actuators.  The  analysis  of  the  critic^  issues  addressing  what  the 
systdn  should  not  do,  allow  to  define  the  hazards  fw  the  system  into  consideration 
and  their  elimination  and  control.  The  analysis  is  performed  in  two  phases:  the  first 
phase  to  identify  the  real  world  properties  relevant  to  the  critical  tehaviour  of  the 
system  and  the  second  phase  to  specify  the  system  behaviour  required  at  the  intoface 
with  the  environment,  i.e.  the  sensors  and  actuators.  Thus  the  constraints  necessary 
to  guarantee  a  safe  behaviour  of  the  system  (safety  constraints)  and  the  strategies  to 
realise  it  (safety  strategies)  may  be  derived. 
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Then  the  system  realising  the  safety  strategy  is  developed  following  a  data  flow 
computational  model.  Since  we  shall  use  the  speciHcation  of  the  physical 
environment  as  the  test  case  for  the  control  system  in  the  validation,  we  shtdl  model 
also  the  plant  As  previously  mentioned  we  adopt  the  formalism  for  the 
specification  of  data  flow  network  proposed  in  [7]  in  which  the  semantics  of  the 
networks  is  based  on  traces.  Here  we  give  some  definitions  and  a  brief  explanation 
on  this  model.  Given  a  data  flow  network  N,  let  V  be  the  set  of  data  items 
exchanged  over  the  channels.  We  denote  by  V*  the  set  of  Hnite  sequences  on  V  and 
by  o  the  empty  sequence. 

Definition:  A  data  flow  node  P  is  a  tuple  <Ip,  Op,  Sp,  s^p,  Rp,  FAIRp> 
where: 

Ip  is  the  set  of  input  channels; 

Op  is  the  set  of  ouqnit  channels  with  (IpoOp)  =  0; 

Sp  is  the  set  of  states;  s^  is  the  initial  state,  sPpe  Sp; 


Rp  is  the  set  of  firings.  A  firing  F  is  a  tuple  F=<s,  Xin>  s',  Xoul>  where  s,  s'e  Sp, 

Xin  is  a  mapping  from  Ip  to  V*  and  Xout  is  a  mapping  from  Op  to  V*. 

FAIRp  c  9*  (Rp)  is  a  finite  collection  of  fairness  sets.  If  FAIRp=Rp,  then  the 
node  executes  fuings  until  no  more  data  are  present  on  the  input  charuiels.  ♦ 


For  the  sake  of  this  paper,  the  meaning  of  a  fuing  <s,  Xin>  s',  Xout>  can  be 
assumed  as  follows:  when  the  node  is  in  state  s  and  fw  each  input  channel  inpe  Ip 

the  sequence  XinO^p)  is  a  prefix  of  the  content  of  the  channel  (i.e.  the  firing  is 
executable),  then  these  sequences  may  be  consumed,  while  the  node  changes  its  state 
to  s'  and  the  sequence  Xout(out)  is  produced  on  each  ouqiut  channel  oute  Op.  Note 
that  the  empty  sequence  o  is  a  prefix  of  each  sequence  of  data. 

A  data  flow  network  N  consists  of  a  set  Pn  of  data  flow  nodes  such  that  in  Pn 
each  channel  occurs  at  most  once  as  an  input  channel  and  at  most  once  as  an  ouq)ut 
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channel.  The  netwwk  is  obtained  connecting  input  channels  lo  output  channels  with 
the  same  name  and  a  netwcxic  transition  can  be  generated  by  the  firing  of  a  node  or 
by  a  communication  event,  where  a  communication  event  can  be  either  an 
input  evoit  or  an  ouq>ut  event.  Communication  events  occur  when  a  data  item  is 
inserted  (removed)  into  (from)  an  input  (ouq)ut)  channel  of  the  netwoiic.  denot  s 
the  set  of  all  the  channels  of  the  network.  A  computation  of  the  network  is  a 
sequence  of  transitions  of  the  network.  Informally  a  computation  of  the  network  is  a 
complete  run  of  the  network  in  which  all  nodes  perform  flrings  acceding  to  their 
definitim  and  all  channels  behaves  like  unbouruled  FIFO  channels.  The  semantics  of 
the  network  is  the  set  of  its  traces:  a  trace  represents  the  interleaving  of  the 
communication  events  during  a  computation. 

The  use  of  information  about  the  presence/absence  of  data  items  and  the  data  driven 
asynchronous  execution  of  data  flow  nodes  in  data  flow  networks,  make  reasoning 
about  these  networks  and  their  semantics  very  difficult  To  perform  the  semantic 
analysis  of  data  flow  networks,  we  apply  a  transformation  from  data  flow  networks 
to  jvocess  algebras  specifications  using  the  1X)T0S  formal  specification  language 
[9].  LOTOS  represents  recent  work  on  the  combination  of  CCS  (with  some 
extension)  [12]  to  describe  the  behaviour  of  the  system  and  an  algebraic  formalism 
for  the  dehnition  of  data  types.  Software  support  tools  have  been  developed  allowing 
the  simulation,  the  compilation  and  the  proof  of  properties  of  a  LOTOS 
specifrcation  [10]. 

The  transformation  is  obtained  by  mapping  each  node  and  each  channel  of  the 
network  into  a  process  in  the  process  algebras  and  then  all  the  processes  are 
composed  in  parallel  with  synchronisation  on  the  proper  set  of  actions  to  realise  the 
globd  behaviour  of  the  network  [13].  The  names  of  gates  in  the  specification  are 
directly  derived  from  the  names  of  the  channels.  For  each  channel  "a^E  Cn>  "a#"  is 
the  gate  COTtesponding  to  get  a  data  from  the  channel  "a”  while  "a"  is  the  gate 
corresponding  to  put  a  data  on  the  same  channel  "a".  Let  CP  be  the  process  which 
simulates  the  behaviour  of  a  channel  "a"  of  N  (CP  behaves  like  a  FIFO  buffer)  and 
nodeP  be  the  process  that  realises  the  behaviour  of  the  data  flow  node  P,  the 
q)ecification  of  the  network  is: 
srjeciftcation  net]^fEgatesfu1 :  noexit 
<data  type  dermition> 
behaviour 

hide  l[CgatesM-Egatesisf]l  in 

(CP[a,  a#]  HI  ...<VceCn>...  HI  CP[b.  b#]) 

l[CgatesN-EgatesN]l  (nodeP[Ip#,  Op]  III  ...<VQ€Pn>...  Ill  nodeQ[lR#,  Or]) 
endspec  (*  neq^  *) 

where  CgatesN  are  the  gates  corresponding  to  get  (put)  from  (onto)  the  whole  sets  of 
channels  of  N,  EgatesN  are  the  gates  corresponding  to  get  (put)  from  (onto)  the 
input  (ouqmt)  external  channels  of  N.  Furthomcne,  the  notation  Ip#  (Op)  is  used  to 
denote  the  set  of  "a#"  ("a")  gates  for  the  input  (output)  channels  of  the  node  P.  The 
set  of  processes  associated  to  channels  execute  disjoint  actions,  so  they  are  put  in 
parallel  with  an  empty  set  of  synchronisation  gates  (III  operator).  The  same  applies 
to  the  set  of  processes  associated  to  the  nodes.  These  two  sets  of  processes 


13 


synchronise  on  the  set  of  all  the  actions  defined  for  the  two  behaviour  expressions. 
The  network  specification  has  the  same  behaviour  of  the  original  data  flow  network 
and  the  formal  verification  methods  of  the  process  algeteis  can  be  tqjplied  to  prove 
prc^rties  of  the  (Miginal  networic.  Interest  readers  may  find  mme  details  on  the 
transfinmation  itself  and  a  prove  that  the  transformation  in^serves  the  data  flow 
netwoik  properties,  i.e.,  the  LOTOS  specification  has  the  same  bdiaviour  of  the 
netwoik  from  which  it  has  been  derived,  in  [13].  The  previous  transformation  is 
defined  fora  class  of  data  flow  netwtnks  in  which  the  firings  of  the  nodes  do  not 
require  sophisticated  synchronisation  mechanisms  between  the  processes  associated 
to  the  channels  and  the  processes  which  simulates  the  behaviour  of  the  nodes.  The 
transformation  for  geno^  networks  is  described  in  [14] . 

To  summarise,  our  methodology  is  based  on: 

•  modelling  the  physical  environment  as  a  part  of  the  overall  system  (plant) ; 

•  executing  the  requirements  analysis  for  both  the  mission  and  the  :;ritical 
issues  of  the  system; 

•  specifying  safety  constraints  and  a  safety  strategy  for  the  system  to  eliminate 
hazards; 

•  developing  the  control  system  in  the  data  flow  computational  model; 

•  2q)plying  the  transformation  to  the  data  flow  specification  of  the  system  (both 
the  control  system  and  the  plant)  obtaining  a  LOTOS  specification  which 
maintains  all  the  relevant  properties  (and  doing  some  expression 
transformation  if  necessary  for  their  automatic  analysis); 

•  verifying  the  correct  behaviour  of  the  system  composed  by  the  plant  and  the 
controller  through  an  automatic  analysis  of  the  LOTOS  resulting  expression 
using  the  available  tools. 

3  The  Train  Set  Example 

The  train  set  example  consists  of  a  simple  control  system  for  the  contact-free 
moving  of  trains  on  a  circular  track  [1],  [11].  Suppose  to  have  one  directional 
moving  of  two  trains  on  a  circular  track  ^vided  in  six  sections,  with  the  constraint 
that  trains  are  less  than  one  section  in  length.  Hazardous  states  are  the  states  in 
which  a  train  may  be  involved  in  a  collision.  In  our  system,  a  state  is  hazardous  if 
the  fiont  of  one  train  is  in  the  same  or  adjacent  section  as  the  front  of  another  train. 
They  are  avoided  in  a  system  if  the  following  condition  (safety  condition)  always 
holds:  the  heads  of  the  trains  differ  at  least  by  2  sections.  The  concept  of  reserved 
section  is  introduced  and  our  safety  strategy  is  based  on:  1)  a  section  can  be  reserved 
by  only  one  utiin;  2)  for  any  train  the  section  of  the  front  of  the  train  and  the  section 
behind  must  be  reserved;  3)  a  train  must  always  reserve  a  section  before  entering  it 
We  use  0  and  0  to  represent  the  operation  of  subtraction  modulo  6  and  the 
operation  of  addition  modulo  6,  respectively. 

We  divide  the  system  under  development  into  the  physical  plant  and  the  controller 
which  communicate  by  sending  control  signals  and  then  we  apply  the  data  flow 
model  based  on  the  Jonsson's  formalism  [7].  The  plant  is  compost  by  six  sections 
(Secto, ....  Sect5)  shown  in  Figure  1  (a).  In  each  section  a  sensor  detects  a  train 
entering  in  the  section  and  an  actuator  has  the  task  to  stop  a  train  before  leaving  the 
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section  when  necessary.  We  model  the  flow  of  the  train  by  messages  sent  by  one 
section  to  the  next  (channel  snO.  On  receipt  of  this  message  the  section  sends  a 
signal  to  the  contrdler  to  notify  the  passage  of  a  train  (channel  esi).  Then  before  the 
train  is  allowed  to  move  on  the  section,  it  waits  for  a  message  flom  the  controller 
with  the  meaning  that  the  train  is  allowed  to  leave  the  section  (channel  goi).  On 
receipt  of  this  message,  the  section  sends  a  message  to  the  next  section  in  the 
circular  track  to  simuls^  the  movement  of  the  train  (channel  sni®i). 


Plant 


Controller 


(b) 

Figure  1:  The  Data  Flow  Netwrxk  of  the  Train  Set  System. 


The  controller  interacts  with  the  plant  and  is  composed  by  the  data  flow  nodes 
repwted  in  Hgure  1  (b):  six  CNTi  nodes  and  six  RESi  nodes.  Each  CNTi  realises 
the  communication  with  the  section  Secq  of  ^  plant  while  each  RESi  implements 
the  correct  reservation  mechanism  of  tlw  corre^nding  section  Secti.  The  CNTi 
node,  after  having  received  a  signal  from  section  Secti  that  a  train  has  arrived 
(channel  esO.  sends  a  signal  to  RESie2  to  mark  section  Sectie2  as  free  (channel 
Vie2)>  tuid  then  it  tries  to  book  the  sectimi  (i01)  for  the  train  sending  a  signal  to 
RESiei  (channel  Piei).  CNTi  waits  for  a  positive  answer  from  RESiei  (the  next 
section  has  been  resoved)  (channel  oki^i);  and  then  it  sends  a  signal  to  Secti  for 
allowing  the  train  to  leave  section  Seat  (channel  goi).  Each  RESi  node  controls  the 
status  of  the  corresponding  section  which  can  be  reserved  fa  one  train  or  free.  It 
receives  signals  from  the  CNTiei  (channel  Pi)  and  reserves  the  section  by  sending 
an  acknowledgement  (chaiuiel  oki).  After  the  section  has  been  reserved  it  accepts 
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only  a  signal  through  the  channel  Vi  to  free  the  section  before  accepting  (and 
making)  any  further  reservation. 

The  resulting  data  flow  netwtxk  N.  composed  by  the  controller  and  the  plant,  is 
shown  in  Hgure  1.  Let  A  and  B  be  natural  numbm  representing  the  identifiers  fw 
the  trains  running  over  the  track,  we  suj^xise  an  initial  state  with  train  A  in  secdtxi 
Secti  and  train  B  in  section  Sects;  if  follows  that  the  sections  1  and  0  (fw  train  A) 
and  the  sections  5  and  4  (fw  train  B)  must  be  reserved.  The  initialisation  is  used  to 
define  the  initial  state  of  the  data  flow  nodes.  For  those  communications  which  are 
signals  we  associate  the  dummy  value  1  in  deflning  of  the  firings  of  the  nodes 
(another  way  is  to  allow  any  data  value).  The  definition  of  the  data  flow  nodes  is: 

Section  node  Secti 

Isecti={sni.  goi)  Osectr  (esi,  sn(i®i)}  Ssec4={s.  s*.  Sb) 

RSectr  {FI.  F2.  F3.  F4)  FAIRsectrRSecti 

Fl=<s,  [sni->A],  s*.  [esi->l]>  F3=<Sa,  [goi->l],  s,  [sn(i®i)->A]> 

F2=<s,  [sni->B],  Sb,  [esi->l]>  F4=<sb,  [goi->l],  s,  [sn(i®i)->B]> 

s®Secti=s*,  s0sect5=Sb  s°Secti=s  for  i={0,2,3,4). 

Controller  node  CNTf 

ICJJTi={eS|,  ok(i®i))  OcNTr  {gOi,  P(i®l),  V(ie2))  ScNTp  (s.  s*) 

RCNTi={F5,F6)  FAIRcNTj=RCNTi 

F5=<s.  [csi->l].  s'.  [V(ie2)*>l.  P(i®l)->1)>  F6=<s’,  [ok(i®i)->l],  s,  [gOi->l)> 

s°CNTi*s'  for  i={  1,5}  and  s^cNTpS  for  i*(0,2,3,4). 

Controller  node  RESj 

lRESi={Pi.Vi)  ORESi={oki}  SRESi={s.  s*} 

RRESr  {F7,  F8}  FAIRRESpRRESi 

F7=<s,  [Pi  ->  1],  s',  [oki  ->  1]>  F8=<s'.  [Vi  ->  1],  s,  D> 

s®RESi=s'  fa* i=(0. 1, 4, 5}  and  s^res^s  for  i=(23}. 

To  apply  the  transformation  we  specify  the  maximum  size  of  the  channels  which 
may  be  assumed  equal  to  two,  while  the  signal  communications  can  be  transfratned 
in  pure  synchronisation  action  in  LOTOS.  We  give  here  the  LOTOS  process 
definition  the  single  data  flow  nodes  obtained  applying  the  transformation 
described  in  Section  2.  The  {Hocess  definitions  for  the  Sect,  CNT  and  RES  nodes 
and  that  fix’  the  CP  which  simulates  a  FIFO  buffer  of  length  two  are: 

ixocess  nodeSect[sn,  es,  go,  nextsn](actstate:  state) :  noexit 

([actstatess]  ->  (sn?X:nat  [XsA];  i;  es!l;  nodeSect[sn,  es,  go,  nextsn](sa) 
n  sn?X:nat  [X^B];  i;  es!  1;  nodeSect[sn,  es,  go,  nextsn](sb)) 

0  [actstatesSa]  ->  go?X:nat;  i;  nextsn! A;  nodeSect[sn,  es,  go,  nextsn](s) 

[]  [actstatessbl  ->  go?X:nat;  i;  nextsnlB;  nodeSect[sn,  es,  go,  nextsn](s)) 
endpiocf*nodeSect*1 
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"■«««««  nodeCNT[e8.  P.  ok.  V.  go](actstate:state) :  noexit 

([actstaie^]  ->  es?X:nat;  i;  V!  1;  P!  1;  nodeQ«rr[es,  P,  ck,  V.  goKs*) 

□  [actsttteBS*]  ->  ok?X:nat;  i;  go!l;  nodeCNTIes,  P,  6k^  V,  go](s)) 

OMblOC  (*  nodeCNT  *) 

noccas  nodeRESIP,ak.V](actstate:staie) :  noexit :« 

([actstaiess]  ->  P?X:nat;  i;  okll;  nodeRES[P,ok,V](s') 

D  [actstatess'l  >>  V?X:nat;  i;  noddlES[P^,V](s)) 
adnCK  (*  nodeRES  *) 
process  CP[inp,  out]:  noexit 

hide  mid  in  onesk)t[inp.  mid]  l[mid]l  oneslot[mid,  out] 
where 

IBCOCess  oneslot[a,  b] :  noexit  :=  a?X:nat ;  b!X;  oneslotfa.  b] 
endproc  (*  oneslot  *) 
endproc  (*  CP  *1 

Since  LOTOS  specifications  bekxiging  to  the  subset  of  LOTOS  without  data  (basic 
LOTOS)  can  be  completely  analysed  by  the  verification  tools,  while  for 
specifications  with  data  values  we  can  only  simulate  andj/cv  compile  and  run  them, 
we  will  restrict  ourselves  to  basic  LOTOS  whenever  possible  without  loosing 
pr(q)erties.  The  LOTOS  behaviour  analyser  AUTO  [1^,  allows  us  to  build  the 
automaton  of  a  basic  LOTOS  specification  to  {xove  strmig  and  weak  bisimulation 
between  specifications.  Although  it  fails  when  running  on  large  specifications, 
simple  ones  like  ours  can  be  successfully  run  and  the  LOGIC  CHECKER  tool  [16] 
can  be  used  to  prove  action-based  logic  fnmulas  ACTL,  over  the  qtecificadon.  To 
this  purpose  we  make  some  manipulatitxis  of  the  qtecification  obtained  directly  by 
the  data  flow  to  LOTOS  transformation,  trying  to  synchronise  processes  and  to  hide 
actions  as  soon  as  possible.  This  allows  AUTO  to  reduce  the  numb^  of  the  states 
during  the  generation  of  the  automaton  of  the  specification.  The  LOTOS 
"Regrouping  Parallel  Processes"  correctness  preserving  transfimnation  can  be 
applied  automatically  by  the  LOTOS  structure  editor  to  regroup  processes 
differently.  The  transformation  preserves  the  strong  bisimulation  equivalence.  All 
the  previous  tools  are  included  in  the  LOTOS  integnoed  tool  environmoit  Lite  [10] 
developed  inside  the  LOTOSPHERE  ESPRIT  project  Since  all  the  nodeSect 
nodeCfn*  and  nodeRES  processes  execute  all  the  actions  in  state  s  and  then  the 
actions  in  the  state  s'  (nodeSect  executes  actions  either  in  Sa  or  Sb)  before  repeating, 
we  assume  s  as  die  initial  state  and  rewrite  the  processes  as; 

ISDKSS  nodeSect[sn,  es,  go,  nextsn]:  noexit 

(sn?X:nat  [X^A];  i;  es!l;  gD?X:nat;  i;  n^tsnlA;  nodeSect[sn,  es,  go,  nextsn] 
n  sn?X:nat  [X^B];  i;  es!l;  go?X:nat;  i;  nextsnlB;  nodeSect[sn,  es.  go,  noitsn]) 
adUBC  C*  nodeSect  *) 

H  nodeCNT[es,  P,  (dc,  V,  go]:  noexit 

^?X;nBt;  i;  V!l;  P!l;  ok?X:nat;  i;  go!l;  nodeCNT[es,  P,  <*,  V,  go] 
endproc  (*  nodeCNT  *) 
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ffttcfuts  nodeRES[P,ok,V]:  noexit :« 

P?Xmat;  i;  ok!l;  V?Xniat;  i;  nodeRES[P,(dc,V] 
gidBD£(*iio(leRES*) 

To  keq>  into  account  the  initial  position  of  trains,  the  cofreqxmding  pixxxsses  must 
contain  a  prefix  behaviour  expression  rq;xesenting  the  action  to  be  perfcNined  at 
system  start  This  lead  to  the  definition  of  the  following  processes:  nodeSectA  and 
nodeSectB  for  the  sections  where  train  A  and  train  B  are  at  the  beginning, 
req)ectively;  nodeICNT  for  the  controllers  that  have  to  reserve  the  next  section  fw 
aDowing  the  trains  to  move  (1  and  S  in  our  case)  and  nodelRES  fw  the  sections  that 
are  reserved  at  the  beginning  (0, 1, 4  and  S  in  our  case).  We  have: 

process  nodeSectA[sn,  es,  go,  es,  nextsn]:  noexit 
go?X:nat;  i;  nextsn!  A;  nodeSect[sn,  es,  go,  nextsn] 

QldiSfiC  (*  nodeSectA  *) 

process  nodeSectB[sn,  es.  go.  es,  nextsn]:  noexit  := 
go?X:nat;  i;  nextsnlB;  nodeSect[sn,  es.  go,  nextsn] 
fioAlOCC*  RodeSectB  *) 
process  nodeICNT[es,  P,  dc,  V,  go]:  noexit  := 

P!l;  (dc?X:nat:  i;  goll;  nodeCNT[es,  P,  ck,  V,  go] 
endproc  (*  nodeICNT  *) 
process  nodeIRES[P.ok.V]:  noexit  :s 
V?X:nau  i;  nodeRES[P.ok,V] 
endproc  (*  nodelRES  *) 

We  can  now  map  our  speciHcation  into  basic  LOTOS.  Lite,  {xovides  many 
mappings  from  a  full  LOTOS  q)ecificati(xi  onto  a  basic  LOTOS  one.  They  differ 
for  the  data  value  inftxrmation  that  are  removed.  We  can  apply  the  simplest 
transformation  named  "trans.npO"  where  all  data  are  dropped,  keeping  simply  the 
original  gate  identifiers  as  basic  LOTOS  actions.  The  transformation  can  be  diTOtly 
invoked  by  the  behaviour  analysis  menu  entry.  In  order  to  apply  this  mtq)ping 
without  loosing  infomation,  we  modify  the  specification  deHning  one  gate  fw  train 
A  and  another  one  fo*  train  B  when  they  run  over  the  track  (i.e.  substituting  each 
action  sni  with  two  actions  asni  and  bsni).  The  new  process  nodeSect  is  simply  a 
non  demministic  choice  between  the  actions  corresponding  to  the  passage  of  the 
two  trains.  This  is  the  only  communication  channel  where  data  are  important,  in  all 
the  others  the  value  of  the  data  are  not  signiflcant  and  can  be  dr(^q)ed.  The  basic 
LOTOS  specification  of  the  section  is: 

mOCfiSS  nodeSect  [asn,  bsn,  es,  go,  nextasn,  nextbsn]  :noexit :» 

(asn;  i;  es;  go;  i;  nextasn;  nodeSect  [asn,  bsn,  es,  go,  nextasn,  nextbsn] 
n  bsn;  i;  es;  go;  i;  nextbsn;  nodeSect  [asn,  bsn,  es,  go,  nextasn,  nextbsn]) 
gidBiSC  (*  nodeSect  *) 


The  behaviour  expression  of  the  whole  specification  of  the  system  is  rqxxrted  in  the 
Appendix;  where  the  observable  actions  are  the  actions  corresponding  to  the 
movement  of  the  trains  over  the  track  (gates  asni#  and  bsnj#).  Note  that  there  are 
not  external  channels  of  the  netwtxk  and  the  set  of  (vocesses  associated  to  the  nodes 
must  synchronise  with  the  set  of  channel  processes  on  the  whole  set  of  gates.  The 
LOTOS  bdiavioural  analyso*  AUTO  can  be  run  over  the  qteciflcation  allowing  to 
easily  prove  our  safety  strategy.  The  automaton  (considering  the  weak  bisimulation 
equivalence)  has  18  states  arid  24  transititms  a^  it  is  deadlock  free.  We  {Hoved 
automatically,  by  using  the  LOGIC  CHECKER  over  the  automaton,  the  following 
logic  formulas  to  be  true  fw  train  A; 

1)  train  A  can  enter  any  section:  A[true(true)U{asni#}true]; 

2)  train  A  can  only  move  from  section  i  to  section  i01; 
AG([asni#]A[true{cond}U(~asn(i0i)#)true]); 

where  conds((~asno#)&(~asni#)&(~asn2#)&(~asn3#)&(~asn4#)&(~asn5#)); 

3)  for  each  path  such  that  train  A  enters  section  i,  train  B  cannot  enter  section  (i01) 
until  train  A  enters  section  (i01): 

AG([asni#]A[tnie{~bsn(iei)#)U{bsn(iei)#}AIlrue{~bsn(i0i)#)U(asn(i©i)#)truel]). 
The  same  formulas  can  be  proved  to  be  true  fOT  the  train  B. 

From  these  we  have  that  when  train  A  is  in  section  i,  train  B  is  never  in  section 
i01.  i,  i01.  This  holds  also  for  train  B,  thus  sruisfying  the  safety  condition. 

4  Conclusions 

In  this  paper  we  have  presented  a  methodology  which  can  be  used  for  the  design  of 
safety-critical  systems  and  for  the  validation  of  the  design.  Quite  tqxut  the  modelling 
of  the  physical  environment  as  a  part  of  the  overall  system  which  can  be  used  as  test 
case  for  the  control  system,  the  use  of  the  data  flow  computational  model  for  the 
description  of  the  system  specification  aUows  the  designer  to  use  notations  which 
are  very  natural  and  which  can  be  made  even  more  user  friendly  by  the  use  of 
development  tools  like  a  graphical  editor  [4].  The  transformation  into  process 
algetoas  specification  allows  the  use  of  the  analysis  tools  available  in  LOTOS, 
making  the  entire  process  from  specification  to  verification  and  validation  fully 
automated. 

The  proposed  approach  has  been  applied  to  a  simple  control  system  where  advantage 
could  be  taken  by  the  use  of  the  basic  LOTOS  tools  like  the  behavioural  analyser 
AUTO  for  the  generation  of  the  automates  and  the  LOGIC  CHECKER.  The 
extension  of  the  proposed  approach  to  the  validation  of  control  systems  LOTOS 
specifications  with  data  value  involves  the  use  of  the  simulator  tool  [10]  and  the 
compiler  available  in  the  full  LOTOS  environment,  which  allows  to  derive  the 
possible  traces  of  execution  of  the  original  data  flow  netwoik.  This  extension  is 
anyway  limited  by  the  fact  that  tracing  the  behaviour  of  a  graeral  network  may  be 
very  lengthy  and  unfeasible  in  case  of  infinite  input  sequences.  Nevertheless  for 
control  systems  where  the  possible  input  sequences  are  constrained  either  on  data 
value  or  on  periodicity,  the  {vqxjsed  approach  can  be  used  for  problems  of  larger 
size  than  that  presented  in  this  paper. 
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Appendix 

specification  SYSTEM  [asnO#,  asnl#,  asn^,  asn3#,  asn4#,  asnS#, 

bsnO#,  bsnl#,  bsn2#,  bsn3#,  bsn4#,  bsnS#] :  noexit 

briiaviour 

hkte 

asnO,  asnl,  asn2,  asn3,  asn4,  asnS,  bsnO,  bsnl,  bsn2,  bsn3,  bsn4,  bsnS,  esO,  esl, 
es2,  es3,  es4,  esS,  esO#,  esl#,  es2#,  es^,  es4#,  esS#,  goO,  gol,  go2,  go3,  go4, 
go5,  goO#,  gol#,  go2#,  go3#,  go4#,  goS#,  PO,  PI,  P2,  P3,  P4,  PS,  TO#,  PI#, 
P2#,  P3#,  P4#.  PS#,  okO,  okl,  ok2,  ok3,  ok4,  okS,  okO#,  okl#,  ok2#,  ok3#, 
ok4#,  ok5#.V0.  VI.  V2,  V3.  V4.  VS.  VO#.  V^#,  V2#,  V3#,  V4#,  VS# 

in 

(nodeSect[asnO#,bsnO#,esO,goO#,asnl,bsnl]  HI  nodeCNT[es0#4%,ok0#,V0,go0]  III 
nodeSectA[asnl#,bsnl#,esl,gol#.asn2.bsn2]  HlnodelCNT[csl#J?l,okl#,Vl,gol]  III 
nodeSect[asn2#.l^2#,es2,go2#,asn3,ten3]  ill  nodeCNT[es2#JP2,ok2#,V2,go2]  III 
nodeSect[asn3#,bsn3#,es3,go3#,asn4,bsn4]  III  nodeCNT[es3#J’3,ok3#,V3,go3j  III 
nodeSect[asn4#,bsn4#,es4,go4#,asnS,bsnS]  III  nodeCNT[es4#J>4,ok4#,V4,go4]  III 
nodeSectB[asnS#,bsnS#,es4,go0#,asni,bsnl]  III  nodeICNT[esS#,PS,okS#,VS,goS]  ill 
nodeIRES[P0#.ok0,V0#]  ill  nodeIRES[Pl#,okl.Vl#]  III  nodeRES[P2#,ok2,V2#]  III 
nodeRES[P3#,ok3,V3#]  III  nodeIRES[P4#,ok4,V4#]  III  nodeIRES[P5#,ok5,V5#]) 

II  (*  full  synchronisation  *) 

(CP[asn0.asn0#]  III  CP[asnl,asnl#]  Hi  CP[asn2.asn2#]  III  CP[asn3.asn3#]  III 
CP[asn4,asn4#]  III  CP[asnS,asnS#]  III  CP[bsnO,bsnO#]  III  CP[bsnl,bsnl#j  III 
CP[bsn2.bsn2#]  III  CP[bsn3,bsn3#]  III  CP[bsn4,bsn4#]  III  CP[bsn5,bsn5#]  III 
CP[es0,es0#]  III  CP[esl,esl#]  III  CP[es2,es2#]  III  CP[es3,es3#]  III  CP[es4,es4#]  III 
CP[es5.es5#]  III  CP[go0,go0#]  III  CP[gol,gol#]  III  CP[go2,go2#]  III  CP[go3,go3#] 

III  CP[go4,go4#]  III  CP[go5.go5#]  III  CP[TO4»0#]  III  CP[P1JP1#]  III  CP[P2J>2#]  III 
CP[P3J>3#]  III  CP[P4J»4#]  III  CP[P54»5#]  III  CP[ok0,ok0#]  III  CP[okl,okl#]  III 
CP[ok2.ok2#]  III  CP[ok3.ok3#]  Hi  CP[ok4,ok4#]  Hi  CP[ok5,ok5#]  III  CP[V0,V0#]  III 
CP[V1,V1#]  III  CPtV2,V2#]  III  CP[V3,V3#1  III  CPIV4.V4#]  III  CP[V5.V5#1) 


where 


<process  definitions> 
endspec  (*  SYSTEM  *1 
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Abstract.  In  verifying  a  safety-critical  system,  one  usually  begins  by 
building  a  model  of  the  basic  system  and  of  its  safety  mechanisms.  If  the 
basic  system  model  does  not  reflect  resiity,  the  verification  results  are 
iTii«l#!«diTig.  We  show  how  a  model  of  a  system  can  be  compared  with  the 
system’s  fault  trees  to  help  validate  the  failure  behaviour  of  the  model. 
To  do  this,  the  meaning  of  fault  trees  are  formalised  in  temporal  logic 
and  a  consistency  relation  between  models  and  fault  trees  is  defined.  An 
important  practical  feature  of  the  technique  is  that  it  allows  models  and 
fault  trees  to  be  compared  even  if  some  events  in  the  fault  tree  are  not 
found  in  the  system  model. 


1  Introduction 

Safety-critical  systems  often  have  mechanisms  designed  to  prevent,  detect,  or 
tolerate  system  system  faults.  To  ensure  that  these  mechanisms  work  as  intended, 
a  model  of  the  system  can  be  built  from  two  parts:  a  model  of  the  basic  system 
and  a  model  of  the  safety  mechanisms  (see  Figure  1).  Important  properties  of 
the  system  are  then  verified  of  the  model.  For  example,  if  a  component  failure 
occurs,  then  it  is  detected. 


Fig.l.  A  Model  of  a  Safety-Critical  System 


For  the  verification  results  to  be  valid,  the  basic  part  of  the  model  should 
refiect  the  true  connection  between  component  failures  and  system  faults  in 
the  system.  We  are  aware  of  a  study  of  a  rail  interlocking  system  in  which  the 
preliminary  system  model  allowed  only  one  train  per  track  section,  thus  making 
collisions  impossible.  Less  obvious  problems  may  be  harder  to  discover,  such  as 
when  a  particular  combination  of  failures  leads  to  a  system  fault  in  the  real 
system  but  not  the  system  model. 
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We  propose  a  validation  technique  in  which  a  system  model  is  compared  to 
its  fault  trees.  If  a  system  model  and  its  fault  trees  are  not  consistent  in  a  sense 
that  we  will  define,  then  the  system  model  may  not  be  valid.  Fault  trees  are 
weU  suited  for  this  purpose  because  they  are  specifically  intended  to  capture  the 
relationship  between  component  failure  and  system  faults. 

The  two  main  sections  of  the  paper  cover  the  precise  meaning  of  fault  trees 
and  our  proposed  relationship  between  fault  trees  and  system  models.  First, 
however,  we  present  an  example. 

2  Example 

To  make  discussion  of  the  problem  more  concrete,  we  present  a  simple  boiler 
system  example  (see  Figure  2). 


Fig.  2.  A  Simple  Boiler  System 


Steam  is  produced  by  water  contained  in  the  boiler  vessel.  The  water  level  in 
the  vessel  is  read  by  two  sensors,  which  pass  their  readings  to  a  control  system. 
If  the  readings  are  below  a  certain  value,  the  pump  is  turned  on,  delivering  water 
to  the  vessel.  If  the  level  readings  are  above  a  cert^  value,  the  pump  is  turned 
off. 

One  safety-critical  fault  of  the  system  is  a  boiler  level  that  is  too  high.  A 
fault  tree  for  this  fault  is  given  in  Figure  2. 

A  fault  tree  represents  how  events  in  a  system  can  lead  to  a  particular  system 
fault.  The  event  symbols  used  here  are  either  basic  events  (which  are  drawn 
as  circles  and  represent  component  failures)  or  intermediate  events  (which  are 
drawn  as  rectangles  and  represent  events  which  occur  because  of  lower-level 
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Fig.  8.  A  Fknlt  TVee  for  the  Boiler  System 


erents).  The  system  fault  is  shown  as  the  event  at  the  root  of  the  tree.  Event 
symbols  are  connected  in  the  tree  by  gate  symbols,  which  are  either  and-gates 
or  or-gates. 

The  full  &ult  tree  notation  has  many  more  event  and  gate  symbols,  but  if 
we  do  not  consider  the  probabilistic  meaning  of  fault  trees  then  the  symbols  we 
have  described  are  enough. 
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3  Fault  TVee  Semantics 

If  we  are  to  compare  fault  trees  and  system  models,  we  need  to  understand 
precisely  what  a  fault  tree  means.  Unfortunately,  even  the  most  definitive  sources 
(e.g.,  the  Fault  TVee  Handbook  [5])  are  vague  on  some  critical  points. 

One  issue  is  the  nature  of  events.  Are  they  to  be  regarded  as  conditions 
having  duration  or  as  instantaneous  occurrences?  The  example  event  “contacts 
fail  to  open”  from  the  Fault  Tree  Handbook  suggests  the  former,  but  the  example 
“timer  reset”  suggests  the  latter. 

The  second  issue  u  the  gate  condition:  does  “and”  mean  that  both  input 
events  happen  at  once,  or  only  that  one  happens  and  then  the  other? 

A  third  issue  is  the  nature  of  causality.  A  gate  modeb  a  sufficient  cause  if 
the  output  must  occur  if  the  gate  condition  is  satisfied  by  the  inputs.  A  gate 
models  a  necessary  cause  if  the  gate  condition  must  be  satisfied  by  the  inputs 
if  the  output  occurs.  According  to  the  Fault  Tree  Handbook,  fault  trees  model 
sufficient  and  necessary  causes.  However,  Figure  IX-10  of  the  Handbook  shows 
an  event  labelled  “wire  faults  in  K3  relay  &  comp,  circuitry”  as  a  cause  of  “K3 
relay  contacts  fail  to  close” ,  but  one  can  imagine  circumstances  in  which  wire 
faults  occur  in  such  a  way  that  the  relay  contacts  do  not  fail  to  close.  Therefore 
the  cause  as  stated  is  not  a  sufficient  one. 

Causes  of  an  event  are  also  supposed  to  be  immediate.  This  term  seems 
related  to  the  notion  of  fiow,  and  may  not  be  relevant  in  systems  that  cannot  be 
captured  easily  with  fiow  models.  All  examples  in  the  Fault  Tree  Handbook  are 
illustrated  with  flow  diagrams.  Immediacy  abo  suggests  time.  For  our  purposes, 
a  gate  models  an  immediate  cause  if  no  time  passes  between  a  cause  and  its 
effect. 

We  now  present  a  formal  semantics  for  fault  trees.  Events  are  treated  as 
conditions  having  duration,  and  the  gate  condition  is  taken  to  be  that  both 
inputs  to  an  and-gate  must  occur  at  once.  Three  different  formalisations  of  gates 
are  given,  corresponding  to  different  stances  on  the  issue  of  gate  causality. 

Formally,  fault  trees  are  interpreted  as  formulas  of  temporal  logic.  We  use 
the  modal  mu-calculus  (see  Appendix  A),  but  nearly  all  temporal  logics  are  ex¬ 
pressive  enough  for  our  purposes.  Similarly,  the  kinds  of  structures  that  temporal 
logics  are  interpreted  over  are  very  general.  We  assume  only  that  a  system  model 
can  be  represented  as  a  transition  system  or  as  a  set  of  sequences  of  states. 

Events  are  formalised  as  atomic  propositions,  which  are  interpreted  as  sets  of 
states.  For  example,  the  event  “sensor  failure”  could  be  modelled  as  the  atomic 
proposition  SF,  which  is  interpreted  as  all  states  in  which  the  sensor  has  failed. 
This  formalisation  of  events  fits  with  most  of  the  examples  of  the  Fault  IVee 
Handbook,  and  is  consistent  with  the  meaning  of  the  term  “event”  in  probability 
theory.  Since  fault  tree  are  subject  to  probabilistic  analysis,  a  consistent  view  of 
events  is  desireable. 

Next  we  will  formalise  the  meaning  of  gates.  We  will  let  -{-(mi,  m2,  out)  stand 
for  an  or-gate  with  inputs  mi  and  m2  and  output  out.  Similarly,  •(mi, m2, out) 
stands  for  an  and-gate.  The  semantics  of  a  gate  y,  denoted  [y],  gives  the  logical 
relationship  between  the  input  and  output  events  of  y. 
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3.1  A  Propositional  Semantics  for  Gates 

Formalising  gates  with  propositional  logic  is  a  simple  approach  that  is  reasonably 
close  to  the  informal  description  of  gates  in  the  Fhult  Tree  Handbook.  In  terms 
of  the  issues  just  discussed,  this  interpretation  requires  and-gate  inputs  to  occur 
at  the  same  time  for  the  gate  condition  to  be  satisfied,  and  takes  causality  to  be 
necessary,  sufficient,  and  immediate.  The  subscript  p  on  the  semantic  function 
stands  for  “propositional”. 

[+(*ni,*na,out)Jp  1=  out  ^  »ni  V  tnj 
[•(mi,m2,out)]p  out  ini  A  in2 

Informally,  the  first  statement  says  that  the  output  of  an  and-gate  is  true 
whenever  both  inputs  are  true.  Remembering  that  events  are  treated  as  sets  of 
states,  the  statement  alternatively  says  that  the  set  of  states  denoted  by  out  is 
the  intersection  of  the  sets  denoted  by  ini  The  concept  of  causality  here 

is  truly  immediate:  whenever  both  causes  are  present  the  effect  is  also  present. 


3.2  Two  Temporal  Semantics  for  Gates 

The  greatest  weakness  of  the  propositional  interpretation  of  fault  trees  is  the 
assumption  that  no  time  can  pass  between  cause  and  effect.  This  assumption 
violates  a  common  intuition  about  causality.  Since  the  examples  in  the  Fault  Tree 
Handbook  mostly  concern  examples  in  which  flow  is  virtually  instantaneous  (as 
in  an  electric  circuit),  the  problem  rarely  arises  there.  In  cases  where  flow  is  not 
instantaneous,  events  are  modelled  so  that  causes  can  be  made  immediate,  albeit 
somewhat  unnaturally.  For  example,  in  the  pressure  tank  analysis  of  Chapter  VII 
continuous  pump  operation  can  lead  to  a  pump  failure.  This  cause  is  modelled 
as  the  event  “tank  ruptures  due  to  internal  over-pressure  caused  by  continuous 
pump  operation  for  t  >  60  sec”.  Since  the  idea  of  a  cause  leading  to  an  event  is 
natural,  it  is  worthwhile  to  try  to  view  fault  trees  in  this  way. 

Our  first  temporal  semantics  requires  that  and-gate  inputs  occur  at  the  same 
time  to  satisfy  the  gate  condition,  and  takes  causality  to  be  only  sufficient,  not 
necessary  or  immediate.  This  means  that  once  the  gate  condition  is  satisfied,  the 
gate  output  must  eventually  occur.  The  temporal  logic  operator  even  is  used 
to  express  the  temporal  condition  of  eventuality.  Thus  even(^)  means  that  the 
property  expressed  formula  <jf  will  hold  in  the  future. 

The  temporal  relation  between  input  and  output  events  for  gates  can  be 
defined  as 


[-f-(mi,m2,out)]fj  1=  (mi  V  m2j  =>  even(out) 
[•(mi,m2,out)]fj  1=  (ini  A 1112)  =>  even(out) 

The  first  definition  says  that  it  is  always  the  case  that  if  input  events  tni  and 
in2  occur  together,  then  eventually  output  event  out  will  occur. 
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Our  second  tempor&l  semantics  treats  causality  as  only  necessary.  The  tem¬ 
poral  operator  prev(^)  means  that  the  property  expressed  by  formula  0  held  in 
the  past. 

[-|-(mi,tn2,out)]^3  out  =►  prev(mj  V  tnj) 
[•(tni,ma,out)]f2  1=  out  =►  prev(mi  A  tna) 

However,  these  definitions  allows  the  gate  output  out  to  occur  many  times 
for  a  single  occurrence  of  ini  A  tna.  A  better  interpretation  might  require  that 
if  out  happens,  then  ini  A  inj  must  have  happened  at  least  as  recently  as  the 
previous  occurrence  of  out. 

There  are  other  possible  interpretations  based  on  other  choices  about  the 
basic  semantic  issues.  For  example,  combining  the  two  temporal  semantics  we 
have  presented  would  give  one  modelling  sufScient  and  necessary  causality. 

Fault  tree  gates  have  been  interpreted  temporally  before  (see  [1]),  but  the 
use  of  temporal  logic  here  allows  much  simpler  semantics.  This  simplicity  makes 
comparison  between  alternative  interpretations  easier. 

3.3  Putting  Gates  Together 

We  now  present  the  semantics  of  a  fault  tree  t  based  on  the  set  of  gates  contained 
in  the  tree  (written  as  gates{t)) .  We  use  the  temporal  operator  alway8(^),  which 
means  that  the  property  expressed  by  ^  holds  in  every  state. 

[t]  alway8(  /\  [g]) 

ge9at€»(t) 

In  English,  this  definition  says  that  it  is  always  the  case  that  every  gate 
condition  is  satisfied.  Note  that  the  meaning  of  a  fault  tree  is  given  in  terms  of 
the  meaning  of  its  gates. 

The  propositional  semantics  has  some  great  advantages  over  the  temporal 
ones.  Because  a  gate  output  is  defined  in  the  propositional  case  to  be  logically 
equivalent  to  the  disjunction  or  conjunction  of  its  inputs,  the  fault  tree  can  be 
manipulated  according  to  the  laws  of  propositional  logic.  This  property  allows 
internal  events  of  a  fault  tree  to  be  removed  by  simplification,  giving  a  relation 
between  only  the  primary  failures  and  the  system  fault  (as  is  found  in  minimal 
cut  set  interpretations  of  fault  trees  [5]). 

A  farther  advantage  of  the  propositional  interpretation  of  fault  trees  is  that 
the  meaning  is  given  as  an  invariant  property  -  a  property  that  can  be  checked  by 
looking  at  states  in  isolation.  Invariant  properties  are  an  easy  class  of  temporal 
logic  formulas  to  prove. 

The  main  advantage  of  the  temporal  semantics  is  their  abUity  to  model  richer 
notions  of  causality.  Unfortunately,  it  is  no  longer  possible  to  eliminate  internal 
events  by  simplification,  and  thus  minimal  cut  sets  cannot  gener^y  be  obtained. 
F\irthermore,  this  formalisation  of  fault  trees  uses  the  temporal  property  of  even¬ 
tuality,  and  is  therefore  a  liveness  property.  This  class  of  temporal  logic  formulas 
are  generally  more  difficult  to  prove  than  invariant  formulas. 
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The  best  interpretation  of  a  fault  tree  probably  depends  on  the  system  being 
studied.  In  some  cases  one  might  want  to  choose  different  interpretations  for 
different  lands  of  gates.  For  example,  or-gates  could  be  interpreted  proposition- 
ally  and  and-gates  interpreted  temporally.  Alternatively,  a  wider  variety  of  gate 
types  could  be  defined,  and  their  use  mixed  in  a  single  fault  tree. 

The  material  in  the  next  section  can  be  applied  independently  of  choice  of 
semantics  for  fault  trees. 

4  Relating  Fault  I^ees  to  System  Models 

The  last  section  showed  that  a  fault  tree  expresses  a  property  of  failure  events  in 
a  system.  We  might  therefore  expect  a  model  of  the  system  to  have  the  properties 
expressed  by  its  fault  trees.  We  will  attempt  to  make  this  relationship  precise. 

Let  T  stand  for  the  set  of  system  faults  for  which  fault  trees  have  been 
developed,  and  let  ft{F)  be  the  fault  trees  for  fault  F.  Given  a  system  model  Ai 
with  an  initial  state  sq,  we  write  sq  ^  if  the  system  model  has  the  property 
expressed  by  formula  The  condition  expressing  that  a  model  ^4  of  a,  system 
is  consistent  with  the  set  of  fault  trees  for  the  system  is 

>0  Nm  a 

This  condition  is  too  strong,  however,  because  usually  a  system  model  will 
capture  only  certain  aspects  of  a  system.  One  way  to  weaken  the  relation  above 
is  to  require  a  system  model  to  satisfy  the  property  expressed  by  a  fault  tree  only 
if  the  system  fault  of  the  tree  is  found  in  the  system  model.  Letting  faults{M) 
stand  for  the  system  faults  in  a  model  Ad,  the  new  consistency  condition  is 

•0  l=M  A 

F€^nfault»(M) 

This  relation  is  still  quite  strong,  however.  If  a  system  model  only  captures 
certain  failures,  then  it  probably  would  not  satisfy  this  condition.  It  would  be 
useful  to  know  the  weakest  relation  that  should  definately  be  expected  to  hold 
between  a  model  of  a  system  and  the  fault  trees  of  a  system.  Our  approach  is 
to  asbume  that  we  know  nothing  about  events  not  given  in  a  system  model.  As 
an  example,  suppose  that  we  have  a  single  or-gate,  +(B,  C,  A),  which  by  the 
propositional  interpretation  gives  the  relation  BVC  between  events  A,  B, 
and  C.  Also  suppose  that  we  know  nothing  about  event  B.  Then  we  will  still 
expect  that  C  =>  A.  Logically  this  amounts  to  the  projection  of  the  relation 
A  O  B  ^  C  onto  the  atomic  propositions  A  and  C.  The  projected  relation  is 
arrived  at  by  taking  the  disjunction  of  the  cases  where  B  is  true  and  B  is  false. 
In  other  words,  the  disjunction  of  A  o  true\/C  and  A  O  fcUseWC  is  equivalent 
to  the  formula  C7  ^  A.  In  the  general  case,  where  more  than  one  event  might 
be  missing,  we  need  to  consider  all  combinations  of  possibilities  for  the  missing 
events. 
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To  formalise  this  idea,  let  4‘W  IQ\  be  the  formula  <t>  with  every  occurrence  of 
atomic  proposition  Q  within  ^  replaced  by  For  example,  AVB[/a/se/B]  gives 
A  V  false.  For  multiple  substitutions,  let  ,^n/Qn]  be  the  formula 

^  with  occurrences  of  Qi,...,Qn  in  4>  simultaneously  replaced  by 
(no  atomic  proposition  is  allowed  to  occur  twice  in  the  substitution  list).  For 
example,  A  V  B  O  C[tTue/A,fal8e/C]  gives  true  V  5  false.  We  will  write 

dcf 

Si  X  Si  for  the  cross  product  of  sets  Si  and  S2,  i.e.,  Si  x  S2  =  {(x,y)  |  x  € 
Si  and  y  € 

Let  Bool  be  the  set  {true,  false}  of  boolean  constants,  and  let  Bool^  be  the 
n-fold  product  of  Bool.  The  interpretation  of  a  fault  tree  in  the  absence  of  a  set 
of  events  S  =:  {oi, . . .  ,an}  is  defined  to  be: 

V  Wlii/oi.  •■•.Wo.,] 

{  fci  1 — 1 }  €  Bool •* 

Let  event8{M)  be  the  set  of  events  in  the  system  model  M,  sxid  let  events(t) 
be  the  set  of  events  in  fault  tree  t.  Then  events(t)\events(Af)  is  the  set  of  events 
found  in  the  fault  tree  t  but  not  the  model  Af.  The  condition  expressing  that  a 
model  Af  of  a  system  is  consistent  with  the  set  of  fault  trees  for  the  system  is 
now 

*0  \=M  A  —  {events{ft{F))  \  event8{M))\ 

F&Tr\faultM{M) 


5  Conclusions 

This  paper  contains  three  contributions  to  the  study  of  safety-critical  systems. 
First,  it  presents  the  idea  that  fault  trees  can  be  used  to  check  the  validity 
of  safety-critical  system  models.  Second,  it  contains  three  formal  semantics  for 
fault  trees.  These  semantics  are  an  improvement  on  earlier  work  by  expressing 
the  meaning  of  fault  trees  with  temporal  logic,  by  expressing  events  as  sets  of 
states,  and  by  identifying  four  elements  of  the  meaning  of  gates:  gate  condition, 
sufficiency,  necessity,  and  immediacy.  Finally,  the  paper  defines  a  consistency 
condition  between  a  model  of  a  system  and  the  system’s  fault  trees  that  works 
even  for  models  that  contain  only  some  of  the  failure  events  in  the  their  fault 
trees. 

Tool  support  for  checking  the  consistency  condition  exists  in  the  form  of 
model  checkers,  which  automatically  show  whether  a  finite-state  model  satisfies 
a  temporal  logic  formula  [3].  Proof  tools  (such  as  [2])  are  available  in  case  the 
model  is  not  finite-state. 

The  work  described  here  should  be  regarded  as  a  first  step  towards  a  complete 
understanding  of  fault  trees  and  their  relation  to  system  models.  As  mentioned 
in  the  section  on  the  semantics  of  gates,  the  formalisation  here  of  necessary 
causes  may  be  too  simplistic.  The  consistency  condition  given  might  need  to 
be  strengthened  to  ensure  that  an  event  representing  a  component  failure  can 
always  occur  provided  it  has  not  already  occurred.  Our  consistency  condition 
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handle  the  case  in  which  a  single  failure  event  in  a  system  model  represents 
several  ftulure  events  in  a  fault  tree. 

A  A  Temporal  Logic 

We  use  an  extended  form  of  the  modal  mu-calculus  [4,  6]  as  a  temporal  logic 
to  express  behavioural  properties.  The  syntax  of  the  extended  mu-calculus  is  as 
foUows,  where  L  ranges  over  sets  of  actions,  Q  ranges  over  atomic  sentences,  and 
Z  ranges  over  variables: 

^  Q  I  I  ^1  A  ^2  I  I  ^  I  vZ.<l> 

The  operator  uZ  binds  free  occurrences  of  Z  in  0,  with  the  syntactic  re¬ 
striction  that  free  occurrences  of  Z  in  a  formula  <f>  lie  within  an  even  number  of 
negations. 

Let  ^  be  a  set  of  states  and  Act  a  set  of  actions.  A  formula  <f>  is  interpreted 
as  the  set  ||^||v  of  states,  defined  relative  to  a  a  fixed  transition  system  T  = 
(5,  {-^  I  a  €  Act})  and  a  valuation  V,  which  maps  variables  to  sets  of  states. 
The  notation  V[5'/^]  stands  for  the  valuation  V'  which  agrees  with  V  except 
that  V*{Z)  =  Since  the  transition  system  is  fixed  we  usually  drop  the  state 
set  and  write  simply  ||0||v  definition  of  ||^||v  is  as  follows: 

IIQIIv  =  V(<?) 

Ih^llv  =  5  -  ll^llv 
11^1  A  02||y  =  11^1  II V  n  ll^llv 

||[L]^||y  =s  {s  €  5  I  if  s  s'  and  a€  L  then  s'  €  ||^||y} 
ll^llv  =  V(^) 

||l/Z.0||y  =  C  I  C  ||^||v[5//^]} 

A  state  s  satisfies  a  formula  relative  to  a  model  M  =  (T,  V),  written  s 

iSoeml- 

Informally,  [L]4>  holds  of  a  state  s  if  ^  holds  for  all  states  s'  that  can  be  reached 
from  s  through  an  action  a  in  L.  A  fixed  point  formula  can  be  understood  by 
keeping  in  mind  that  vZ.^  can  be  replaced  by  its  '^unfolding”:  the  formula  ^  with 
Z  replaced  by  uZ.^f>  itself.  Thus,  i/Z.if;  A  [{a}]Z  =  ^  A  [{a}]{uZ.fl>  A  [{o}]Z)  = 
tp  A  [{a}](^  A  [{a}]{vZ.‘i^  A  [{a}]Z))  =  . . .  holds  of  any  process  for  whidi  ^  holds 
along  any  execution  path  of  a  actions. 

The  operators  V,  {a),  and  fiZ  are  defined  as  duals  to  existing  operators  (where 
0[^/Z]  is  the  property  obtained  by  substituting  ^  for  free  occurrences  of  Z  in 

i): 

<h^=  A  -1^2) 

{L)<f>  1=  -\L\-><t> 

HZ4  =' 
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These  additional  basic  abbreviations  are  also  convenient: 

[-1^  [Act]<l> 

true  uZ.Z 

false  ^  -itrue 

Common  operators  of  temporal  logic  can  also  be  defined  as  abbreviations: 
alway8(^)  vZ.<f>  A  {-\Z 

even(^)  ^  V  {{-)true  A  [-]Z)  \ 

j' 

To  define  a  previously  operator,  a  reverse  modal  operator  [X]  must  be  added  I 

to  the  logic.  ! 

) 

||[X]^||v  =  {«  €  5  I  if  5  and  a  €  X  then  «'  €  H^Hy}  ^ 

i 

The  previously  operator  is  just  the  reverse  version  of  even:  j 

prev(^)  t^Z.4>  V  ({-)true  A  p|Z)  ( 

'<  .> 
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Abstract.  In  this  paper  we  illustrate,  by  way  of  examples,  comfKxsition, 
analysis  and  refinement  of  systems  modelled  by  means  of  probabilistic 
automata  behaving  as  Markov  chains  over  discrete  time.  For  a  formalised 
and  general  treatment  of  these  ideas  the  reader  is  referred  to  [6]. 

1  Introduction 

Dependability  analysis  of  failure-prone  real-time  systems  or  performance  anal¬ 
ysis  of  real-time  systems  interacting  with  stochastic  environments  is  frequently 
based  on  the  use  of  Probabilistic  Automata,  PA’s,  with  Markov  properties  as 
computation  models. 

A  recent  formalism  using  this  approach  is  Probabilistic  Duration  Calculus 
PDC  [5],  an  extension  of  Duration  Calculus,  DC,  which  in  turn  was  developed 
for  specification  and  verification  of  embedded  real-time  systems  [7,  1]. 

For  a  given  discrete-time  PA  representing  a  design  PDC  makes  it  possible  to 
calculate  and  reason  about  the  probability  that  the  PA  satisfies  a  DC-formula 
(expressing  a  requirement  or  design  decision)  during  the  first  t  time  imits. 

In  this  paper  we  consider  parallel  composition  of  component  PA’s  into  larger 
component  PA’s  or  into  a  system  PA.  Each  component  PA  may  depend  on  states 
in  the  other  components,  (the  PA  is  then  said  to  be  open),  but  the  system  PA 
is  independent  of  external  states  (the  PA  is  then  said  to  be  closed).  Closedness 
is  a  condition  for  analysis  by  means  of  PDC. 

We  also  consider  probabilistic  refinement  with  respect  to  a  DC  formula.  This 
means  that  with  two  consecutive  system  designs,  if  for  all  t  >  0  the  second  design 
satisfies  the  requirement  with  higher  probability  than  the  first  design,  then  the 
second  design  is  said  to  refine  the  first  design  with  respect  to  the  requirement. 
Simple  examples  of  probabilistic  refinement  are  included. 

Compositionality  has  also  been  treated  in  probabilistic  extensions  of  CSP- 
and  CCS-like  process  algebras  [4,  2],  but  none  of  these  approaches  cover  the 

*  This  research  was  supported  by  the  Danish  Technical  Research  Council  under  project 
Co-d«sign.  The  research  of  Zhiming  Liu  was  also  supported  in  part  by  research  grant 
GR/H39499  from  the  Science  and  Engineering  Research  Council  of  UK. 


r 


32 

dependencies  between  components  referred  to  above.  A  notion  of  probabilistic 
r^nement  different  from  ours  is  described  in  [3].  In  this  work  probabilistic 
specifications  prescribe  permissible  intervals  for  the  target  probabilities,  and 
refinement  refers  to  the  narrowing  of  these  intervals  in  subsequent  specifications. 


2  Probabilistic  Automata  Over  Discrete  Time 

The  behaviour  of  probabilistic  systems  having  a  finite  number  of  states  is  con¬ 
veniently  modelled  by  means  of  finite  probabilistic  automata  over  discrete  time. 
These  automata  are  defined  by  the  set  of  states,  the  set  of  initial  state  proba¬ 
bilities  and  the  set  of  transition  probabilities  per  time  unit. 

As  a  running  example  we  consider  a  Gas  Burner  consisting  of  a  Burner  (an 
abstraction  of  the  gas-valve,  the  ignition  device  and  the  control  box)  and  a 
Detector  (an  abstraction  of  the  mechanism  for  detection  of  unbumt  gas).  We 
assume  that  the  gas  is  turned  on  at  t  =  0  and  remains  on. 

2.1  States 

The  Burner  and  Detector  components  are  characterised  by  disjoint  sets  of  prim¬ 
itive  Boolean  states.  For  simplicity  these  sets,  denoted  Ab  and  Ad  respectively, 
are  assumed  to  be  the  singleton  sets 

Ab  =  {Flame}  and  Ad  —  {Act} 

where  Flame  asserts  that  the  flame  exists  and  Act  asserts  that  the  Detector  is 
able  to  detect  unbumt  gas.  (If  the  gas  was  not  permanently  on,  we  would  have 
to  define  Ab  as  {Gas,  Flame}  where  the  additional  primitive  state  Gas  asserts, 
that  gas  is  released.) 

The  set  of  component  states  Sb  and  Sd  are  defined  as  subsets  of  the  set  of 
minterms  over  Ab  and  Ad  respectively. 

Sb  =  {->Flame,  Flame}  C  and  Sd  =  {->Act,  Act}  C 

(where,  in  this  case,  all  minterms  are  possible  states).  Accordingly  the  Burner 
makes  transitions  between  the  states  ->Flame  and  Flame  (corresponding  to 
alternation  between  successful  flame  ignitions  and  unintended  flame  extinctions) 
while  the  Detector  makes  transitions  between  Act  and  ->Act  (corresponding  to 
alternation  between  failure  and  repair). 

For  the  composed  system  we  have: 

A  =  Ab  U  Ad  =  {Flame,  Act} 

S  =  Sb  X  Sd  = 

{->Flame  A  Act,  Flame  A  Act,  ->Flame  A  -lAct,  Flame  A  -lAct}  C  2^ 
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2.2  Dependency  of  External  States 

In  a  composite  system  a  component  can  only  change  its  own  primitive  states. 
However,  the  local  transition  probabilities  may  depend  on  external  states,  i.e. 
states  whidi  are  local  to  other  components  in  the  composition. 

This  is  exemplified  by  the  Burner.  For  example,  given  that  the  Burner  is  in 
state  -yFlame  at  time  t,  the  probability  that  it  will  be  in  state  Flame  at  time 
t  + 1  is  zero  if  the  Detector  is  non-active  and  non-zero  if  it  is  active.  In  the  latter 
case  the  actual  value  depends  on  the  quality  of  the  ignition  mechanism  in  the 
Burner. 

2.3  Open  and  Closed  Probabilistic  Automata 

The  previous  discussion  suggests  that  a  component,  which  depends  upon  envi¬ 
ronmental  states,  should  be  modelled  by  a  collection  of  sub-models,  one  for  each 
environmental  state.  This  is  illustrated  by  the  transition  graphs  for  the  Burner 
automaton  in  Figure  la  according  to  which: 

-  The  Burner  starts  from  state  -^Flame  with  probability  pi  =  1. 

-  With  the  Detector  in  state  Act  the  Burner  behaves  as  follows 

•  Given  that  it  is  in  state  -iFlame^  it  remains  in  that  state  with  prob¬ 
ability  pii  per  time  unit  or  it  goes  to  state  Flame  with  probability 
Pi2  per  time  unit  where  pn  +  Pia  =  1. 

•  Given  that  it  is  in  state  Flame,  it  remains  in  that  state  with  proba¬ 
bility  P22  per  time  unit  or  it  goes  to  state  -^Flame  with  probability 
P21  per  time  unit  where  P22  +P21  =  1- 

-  With  the  Detector  in  state  ->Act  the  Burner  behaves  as  follows 

•  Given  that  it  is  in  state  ->Flame,  it  remains  in  that  state  with  proba¬ 
bility  pii  =  1  per  time  unit.  This  implies  that  pi2  =  0,  i.e.  the  Burner 
can  never  go  to  state  Flame. 

•  Given  that  it  is  in  state  Flame,  it  remains  in  that  state  with  probar 
bility  P23  per  time  unit  or  it  goes  to  state  -^Flame  with  probability 
P21  per  time  imit  where  P22  and  p2i  are  the  same  as  when  the  detector 
is  in  state  Act. 

Pii  and  P12  depend  on  the  detector  state  because  they  characterise  the  ability 
of  the  Burner  to  establish  Flame.  In  contrast  to  this  P22  and  P21  are  entirely 
independent  of  the  detector  state  because  they  characterise  the  stability  of  the 
fiame. 

A  PA  in  which  some  transition  probabilities  depend  on  external  states  will 
be  called  an  open  PA. 

For  the  Detector  automaton  we  assume  the  transition  graph  shown  in  Figure 
lb  according  to  which: 
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—  The  Detector  starts  from  state  Act  with  probability  qi  or  from  state  ->Act 
with  probabiUty  gj  (=  1  -  ^i). 

—  Given  that  it  is  in  state  Act,  it  remains  in  that  state  with  probability 
per  time  unit  or  it  goes  to  state  -fAct  with  probability  qi2  per  timp  unit 
whm«  qii  +  qi2  =  1. 

—  Given  that  it  is  in  state  -»Act,  it  remains  in  that  state  with  probability 
q22  per  time  unit  or  it  goes  to  state  Act  with  probability  q2i  per  time  unit 
where  q22+q2i  —  1* 

The  transition  probabilities  of  the  Detector  are  independent  of  the  Biurner  state. 
This  reflects  that  the  probabilities  of  failure  or  repair  of  the  Detector  are  con¬ 
sidered  to  be  unaffected  by  flame-  or  ignition  failures  occurring  in  the  Burner. 

A  PA  in  which  no  transition  probabilities  depend  on  external  states  will  be 
called  a  closed  PA. 


Cond.  Act  Cond.  ->Act 


a:  Burner  (Open  automaton) 


Fig.l.  The  probabilistic  component  automata:  Burner  and  Detector 


3  Parallel  Composition  of  PA’s 

The  PA  for  the  Gas  Btuner  is  determined  by  parallel  composition  of  the  PA’s 
for  the  Burner  and  the  Detector. 

Gas-Burner  =  Burner  ||  Detector 

This  operation  is  fully  formalised  and  generalised  in  [6].  The  resulting  PA  for  the 
Gas  Burner  is  shown  in  Figure  2,  (where  the  state  numbering  is  arbitrary  and 
introduced  for  later  use).  The  reasoning  behind  this  construction  is  illustrated 
informally  be  means  of  a  few  examples. 
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The  probability  that  the  Gas  Burner  starts  in,  say,  state  3:  ->Flame  A  ->Act 
is  the  product  of  the  probability  that  the  Burner  starts  in  state  ->Flame  and  the 
probability  that  the  Detector  starts  in  state  ->Act.  According  to  Figure  1  this 
product  iBpi*q2  —  1*02  —  02-  Similarly  we  find  that  the  initial  probabilities 
of  states  1,  2  and  4  are  91,  0  and  0  respectively. 

The  probability  that  the  Gas  Burner  transits  from  e.g.  state  1:  ->FlameAAct 
to  state  2:  FlameAAct  within  one  time  unit  is  the  product  of  the  probability  that 
the  Burner  transits  from  ->Flame  to  Flame,  given  that  the  Detector  is  active, 
and  the  probability  that  the  Detector  transits  from  Act  to  Act,  both  within 
one  time  unit.  This  product  is  pi2  *  911.  On  the  other  hand,  the  probability 
that  the  Gas  Burner  transits  from  state  3  to  state  4  is  zero  because  it  requires 
a  transition  of  the  Burner  from  ->Flame  to  Flame  while  the  Detector  is  non¬ 
active,  but  according  to  Figure  1  this  is  impossible. 

The  other  composite  transition  probabilities  can  be  determined  in  a  similar 
way. 


Fig.  2.  The  closed  automaton  Gaa-Burntr  =  Burner  ||  Detector 

We  observe  that  even  though  the  Burner  PA  is  open,  the  composition  of  the 
Bmmer  and  the  Detector  PA’s  is  a  closed  PA.  This  illustrates,  that  when  we 
compose  two  component  PA’s  the  dependencies  of  one  PA  on  primitive  states 
in  the  other  will  be  hidden  in  the  resulting  PA^. 

^  This  resembles  the  hiding  of  a  communication  between  two  processes  in  parallel  composition. 
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In  [6]  the  formalisation  of  such  constructions  is  based  on  a  representation  of 
a  PA  as  a  tuple  (5,C,ro,r)  where  S  is  the  set  of  states,  C  is  a  set  of  primi¬ 
tive  ectemal  states  which  is  sufficient  for  definition  of  openness,  tq  is  the  initial 
probability  function  and  r  is  the  transition  probability  function,  which  is  param- 
eterised  by  the  elements  in  the  condition  space  2^.  A  PA  is  closed  if  no  transition 
probability  depends  on  the  conditions  in  2^.  C  can  then  be  eliminated  from  the 
tuple. 

The  parallel  operator  ||  constructs  a  well  formed  tuple  (Sc,C'c,toc,Tc)  from 
two  well  formed  tuples  (5a,C'a,roa,Ta)  and  (S5,C6,ro6,Ti)  such  that 

Sc  =  SaxS(,C2^-^^'’  and  =  (Ca  \  Ab)  U  (Ct  \  AJ 
where  Aa  and  Ab  are  the  sets  of  primitive  states  for  components  a  and  b. 

4  Requirements  and  Satisfaction  Probabilities 

The  Gas  Burner  has  critical  states  characterised  by  release  of  gas  while  the  flame 
is  absent.  The  disjunction  of  these  states  is  a  state  called  Leak.  Since  the  gas 
is  permanently  on  in  our  example.  Leak  is  identical  to  -^Flame  which  is  the 
disjunction  of  states  1  and  3  on  Figure  2 

One  of  the  design  decisions  could  be  that  whenever  Leak  occurs,  it  should 
be  detected  and  eliminated  within  one  time  unit.  In  [7]  this  constraint,  called 
Des-1,  is  expressed  as  the  following  formula  in  Duration  Calculus; 

Des-1:  □(fLeaJbl  ^  <  1) 

This  formula  reads: 

□  “For  any  subinterval  of  the  observation  interval,” 

\Leak'\  =»  “if  there  is  Leak  in  that  subinterval  then” 

i  <1  “its  length  should  not  exceed  one  time  imit” . 

Duration  Calculus,  DC,  is  an  interval  logic  for  the  interpretation  of  Boolean 
states  over  time  (a  logic  for  timing  diagrams).  Its  distinctive  feature  is  reasoning 
about  durations  of  states  within  any  time  interval  without  explicit  reference  to 
absolute  time.  It  is  used  to  specify  and  verify  rezd-time  requirements.  The  reader 
is  referred  to  [7,  1].  for  further  details. 

For  a  real  design  with  failure-prone  components  we  can  not  in  general  expect 
a  duration  formula  (expressing  some  requirement)  to  hold  for  all  times.  The 
question  is  then:  does  it  hold  with  sufficiently  high  probability  over  a  specified 
observation  interval  [0,  t].  This  question  is  answered  as  follows. 

Let  G  denote  the  closed  PA  modelling  the  design  and  D  denote  the  formula. 
Then  we  must  compute  the  probability  that  G  satisfies  D  in  the  time  interval 
[0,t].  This  probability  is  called  the  satisfaction  probability  of  D  by  G  and  is 
denoted  fioiD)[t]. 
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Satisfaction  probabilities  are  computed  or  reasoned  about  by  means  of  Prob¬ 
abilistic  Duration  Calculus,  PDC  [5],  a  recent  extension  of  Duration  Calculus^. 

The  DC  formula  for  Des-1  belongs  to  a  class  for  which  the  satisfaction 
probability  can  be  expressed  explicitly  in  PDC  by  means  of  the  initial  probability 
vector  p  and  the  the  transition  probability  matrix  P  of  G,  [5V  These  matrices 
are  well  known  from  the  theory  of  Markov  chains.  For  the  PA  of  the  composite 
Gas  Burner,  Figure  2,  they  are  given  by: 


P  =  (7i .  0 ,  92 , 0)  and  P  = 


^  Pii9li 

P12911 

Pll9l2 

P12912 

P2iqn 

P22911 

P21912 

P22912 

921 

0 

922 

0 

^  P21921 

P22921 

P21922 

P22Q22 

\ 

/ 


with  row-  and  column  ordering  according  to  the  chosen  state  numbering.  It  is 
easy  to  see,  that  for  p  as  well  as  for  each  row  of  P  the  sum  of  elements  is  1  (this 
is  a  well-formedness  condition  for  probability  matrices). 

With  D  denoting  a  DC  formula  of  the  class  referred  to  above,  the  explicit 
expression  for  /iG(D)[t  -I- 1]  is  a  scalar  product  of  the  form  [5]: 


/iG(D)[t  +  1]  =  P'  •  (P')*  •  Ic 


where  p'  and  P'  are  obtained  from  p  and  P,  respectively,  by  replacement  of 
certain  entries  (depending  on  D)  by  zeros,  (P*)*  denotes  the  f’th  power  of  P' 
and  Ic  denotes  a  column  vector  in  which  all  elements  are  1. 

For  G  representing  the  composite  Gas  Burner  and  with  D  given  as  the  DC 
formula  for  Des-1  above,  p'  =  p,  (i.e.  no  entries  in  p  needs  to  be  zeroed)  and 
P'  is  obtained  from  from  P  by  changing  the  entries  in  P  with  (row,column) 
numbers  (1,1),  (1,3),  (3,1)  and  (3,3)  to  zero.  Accordingly: 

fioi^ilLeak-]  =>  I  <  l))[t  -h  1]  =  p'  •  (P')*  •  Ic  = 


(  ^ 

P12911 

0 

P12912 
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t 

A\ 

P2191I 

P22911 

P21912 

P22912 
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<  P21921 

P22921 

P21922 

P22922 

vJ 

Informally  the  rules  for  obtaining  the  primed  matrices  from  the  unprimed 
ones  (i.e.  for  the  zeroing  of  entries)  are  as  follows: 

If  D  is  violated  by  behaviours  which  have  state  i  as  the  initial  state  (the 
state  in  the  first  time  unit),  then  the  i’th  entry  of  p  should  be  zeroed  in  the 

*  The  semantic  model  of  PDC  is  the  finite  probability  space  induced  by  G,  where 

V*  is  the  set  of  behaviours  (state  sequences  of  G)  of  length  t  and  ft  is  the  probability 
measure  which  assigns  a  probability  to  each  behaviour.  This  probability  is  the  product  of 
the  initial  probability  and  the  transition  probabilities  involved  in  the  behaviour. 
is  then  defined  as  the  sum  of  the  behaviour  probabilities  for  the  subset  of  behaviours  which 
satisfy  D  over  the  first  t  time  units. 
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matrix  expression  for  n{D)[t  +  1].  If  D  is  violated  by  a  transition  from  state  t 
to  state  j,  then  the  entry  at  location  (i,  j)  in  P  should  be  zeroed. 

The  DC  formula  for  Des-1  places  no  restriction  on  the  choice  of  initial 
state,  and  accordingly  no  entry  of  p  needs  to  be  zeroed.  However,  the  formula 
is  violated  for  all  transitions  such  that  there  is  Leak  before  as  well  as  after 
the  transition.  This  is  because  such  transitions  imply  existence  of  Leak  states 
lasting  for  at  least  two  time  units,  whereas  the  formula  only  tolerate  Leak  states 
lasting  for  at  most  one  time  unit.  As  previously  observed.  Leak  (=  -^Flame) 
holds  for  the  composite  states  1  and  3  on  Figure  2.  This  implies  that  the  offensive 
transitions  are  those  associated  with  entries  (1,1),  (1,3),  (3,1)  and  (3,3)  in  P. 


5  Probabilistic  Refinement 


Let  G\  and  be  the  closed  PA’s  representing  two  designs  and|)et  D  represent 
a  common  requirement  for  these  designs.  Then  G^  is  said  to  refine  Gi  with 
respect  to  D  if,  and  only  if: 

Vto  >  0*MGj(P)[fo]  >  /io,(f^)[fo] 

We  shall  now  examine  this  for  various  Gas  Burner  designs  and  for  Des-1. 

First  we  notice  that  if  fo  =  0,  then  Des-1  will  be  trivially  satisfied  for  any 
design  G,  i.e.  (Des-1) [0]  =  1.  The  reason  for  this  is  that  in  the  formula  for 
Des-1  the  left  side  of  the  implication,  i.e.  fLea^],  is  false  for  a  point  interval 
(a  leak  state  must  last  for  at  least  one  time  unit). 

For  to  >  0  we  make  the  substitution  fo  —  t  +  I,  t  >  0  and  compute 
^(Des-l)[t  +  1]  from  the  matrix  expression  p'  •  (P')*  •  Ic. 

As  previuosly  explained,  with  Des-1  as  the  D  formula,  p'  =  p.  This,  in  turn, 
implies  that  the  matrix  expression  will  evaluate  to  1  for  t  =  0.  The  reason  for 
this  is  that  the  sum  of  entries  in  p  is  1  and  (P')°  is  the  identity  matrix.  This 
result  refiects  that  the  initial  Leak  state  (caused  by  the  necessary  gas  release 
before  the  first  ignition)  lasts  for  (at  least)  one  time  unit  and  does  not  violate 
Des-1  during  the  first  time  unit. 

We  will  compare  fom:  designs  of  the  Gas  Burner 

-  Design  1  is  a  poor  design.  It  has  a  Burner  but  no  Detector  (or  the  Detector 
is  permanently  non-active).  The  PA:  Gi  for  this  design  is  shown  in  Fig  3a. 
The  probability  matrices  are 


P=(1,0)  p'=p  M  M 

^  ^  \p21P22J  \P21P22J 

It  is  easy  to  prove  (and  intuitively  clear),  that  pG^{'DeB-l)[t  +  1]  =  0  for 
t  >  0.  This  refiects  that  the  Gas  Burner  never  will  be  able  to  establish 
flame,  i.e.  to  eliminate  the  initial  Leak  caused  by  gas  release. 
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a)  Design  1. 


b)  Design  3 


c)  Design  4. 


Fig.S.  Various  designs  of  the  Gas  Burner.  (Design  2  is  defined  by  Figure  2.) 

-  Design  2  is  the  composite  Gas  Burner:  {Burner  ||  Detector)  modelled 
and  analysed  in  the  previous  sections.  Intuitively  Gs  (defined  by  Figure  2) 
refines  Gi  because  it  uses  a  detector  which  is  not  permanently  failed.  This 
can  be  validated  by  computation  of  the  matrix  expressions  for  a  suitable 
range  of  t’s. 

-  Design  3  is  the  composite  gasbumer  with  a  permanently  active  Detector. 
Leak  is  detected  immediately,  but  Ignition  may  still  fail  with  probability 
Pii  within  one  time  unit.  The  PA:  Gz  for  this  design  is  shown  in  Fig  3b. 
Since  Act  is  always  true,  there  are  only  two  states  ->Flame  and  Flame  to 
consider.  The  probability  matrices  are 

P  =  (1,0)  P'=P  p-r. 

\P2lP22J  \P2iP22J 

Intuitively  G3  refines  G2  because  it  uses  a  permanently  active  detector. 
This  can  also  be  validated  by  computation  of  the  /i’s  over  a  suitable  time 
range. 

-  Design  4  is  the  ideal  composite  Gas  Burner.  The  Detector  is  perfect  and 
the  ignition  always  succeeds  within  one  time  unit.  The  PA:  G4  for  this 
design  is  shown  in  Fig  3c. 

The  probability  matrices  are 

p=(1.0)  p'=p  P'  =  P 

FVom  a  theorem  in  [5]  it  follows,  that  with  p'  —  p  and  P'  =  P  the 
matrix  expression  f/  ■  {P'Y  • evaluates  to  1  for  all  values  of  t.  Therefore 
P04(De»-l)[t  4- 1]  =:  1  independent  of  t  and  accordingly  G4  refines  all 
other  Gas  Bmner  designs  with  respect  to  Des-l  (but  of  cause  Design  4  is 
not  implementable). 


Verification  is  stronger  than  validation.  However,  proof  of  probabilistic  re¬ 
finement  is  a  diflScult  area,  and  so  far  we  can  only  offer  the  following  theorem 
applicable  under  rather  special  conditions  [6]. 

For  Z7-formulas  such  that  /i(Z))G[t  +  1]  =  •  {P'Y  •  Ic  and  for  two  designs 

Ga  and  Gb  with  the  same  number  of  states,  if  all  elements  in  p'^  are  greater 
than  or  equal  to  the  corresponding  elements  in  p^,  and  the  same  property  holds 
for  P’g  and  P'a  then  G  b  refines  Ga  with  respect  to  D. 

This  theorem  is  appUcable  to  Designs  1,  3  and  4  and  proves  that  Design  4 
refines  Design  3  which  in  turn  refines  Design  1. 

6  Conclusion 

We  have  presented  new  results  concerning  composition,  analysis  and  refinement 
of  probabilistic  real  time  systems.  The  technique  needs  further  consolidation 
with  regard  to  tools  and  theorems  for  validation  and  verification,  and  its  prac¬ 
tical  applicability  to  realistic  dependability  problems  remains  to  be  tested. 
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1.  Introduction 

The  overall  goal  of  a  clinical  laboratory  is  to  analyse  samples  of  blood  and  other 
bodily  fluids  received  from  a  patient,  and  to  return  the  correct  results  to  the 
patient's  doctor  within  a  suitable  period.  Automated  analysers  are  used  in  most 
laboratories  to  analyse  the  various  samples.  Computer  systons  (LIMS^)  are  widely 
used  in  laboratories  to  control,  support  and  monitor  the  woik  done  in  the 
laboratory,  keeping  pace  with  the  increased  analytical  capability  provided  by  these 
analysers.  In  particular,  a  UMS  is  typically  used  to  control  and  monitor  (at  least): 

•  the  working  of  the  analysers,  deciding  what  tests  need  to  be  done  for 
each  sample  by  each  analyser, 

•  the  collating  of  requests  and  results,  and 

•  the  printing  of  the  results. 

The  results  of  an  analysis  will  directly  influence  the  treatment  of  a  patient  - 
treatment  that  can  have  potentially  life-threatening  consequences.  For  example,  in 
some  types  of  su^)ected  heart  attacks,  treatment  is  largely  based  on  the  results  of 
the  analysis.  The  patient  can  die  if  the  wrong  treatment  is  administered.  In  a  recent 
case  in  the  United  Kingdom,  a  bank's  computer  syston  sent  payments  to  the  wrong 

^Laboratory  Informatioo  Management  System 
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accounts.  Consider  wliat  might  happen  (in  the  equivalent  scenario)  if  the  wrong 
results  were  sent  to  a  patient  by  the  UMS. 

Consequently,  although  a  UMS  is  fundamentally  an  infmmation  system,  it 
must  be  classed  as  a  safety  critical  system  and  developed  as  such.  In  the  main, 
however,  these  safety  mq)Iications  have  not  been  considered  in  the  devek)|mient  of 
LIMS's.  Furthermore,  in  contrast  to  other  disciplines,  little  effort  has  been  ^nt  on 
the  standardisation  and  classification  of  the  safety  aspects  of  using  cranputer-based 
systems  for  medical  care. 

These  problems  will  bectmie  more  acute  when  LIMS's  are  linked  to  general 
hospital  information  systems.  In  the  long-term,  this  will  make  the  laboratory  results 
accessiUe  from  wards  in  the  hospital  and  firmn  local  GP  surgeries.  The  aim  of  this 
is  to  imixove  patient  care.  However,  as  more  and  more  people  have  (instant)  access 
to  the  results,  it  is  essential  to  ensure  the  integrity  and  correctness  of  any  results 
diat  are  accessible  from  outside  the  laboratory. 

This  papa*  discusses  the  re-develc^nnent  of  the  UMS  at  the  WMH,^ 
undertaken  as  part  of  the  MORSE^  project  and  carried  out  jointly  by  WMH  and 
Lloyd's  Register.  The  MORSE  project  uses  a  multi-disdplinary  approach  to  the 
develoixnent  of  safety  critical  systons,  based  on  those  proposed  in  the  draft 
Ministry  of  Defence  standards  00-55  [1]  and  00-56  [2].  This  ^>proach  combines  the 
use  of  safety  analysis  with  the  use  of  formal  development  methods.  This  paper 
describes  die  overall  ^^[xoach,  and  concentrates  on  the  triplication  of  RAISE  [3],  a 
particular  fimnal  development  method,  in  the  re-development  of  the  LIMS.  As 
^lace  will  not  allow  a  full  description  erf"  the  work,  the  use  of  RAISE  will  be 
illustrated  by  describing  the  specification  of  c^tain  key  areas. 

2.  SimpUfied  LIMS 

A  simplified  layout  of  a  LIMS  is  shown  in  Figure  1.  This  shows  a  complete 
analytical  locr>,  with  the  LIMS  controlling  a  single  analyser.  It  indicates  how: 

•  test  requests  are  entered  using  a  terminal, 

•  the  analyse  receives  the  samples  and  requests,  analyses  the  samples 
and  returns  the  ^ropriate  results,  and 

•  the  results  are  (Minted  before  being  di^tched  to  the  patient. 

This  is  only  one  of  many  possible  layouts,  and  in  (nnctice  the  LIMS  would  be 
controlling  several  analysers.  However,  by  using  a  single  analyser,  the  description 
of  the  UMS  is  simplified  as  all  test  requests  will  go  to  the  one  analyser.  As  a 
furdier  simplification,  the  validation  and  archiving  of  results  is  not  shown  here. 


^We«t  Middlesex  University  Hospital 

^Method  for  Object  Re-use  in  Safety-critical  Environments 
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3.  Approach  used  to  analyse  the  LIMS 

The  data  handling  in  die  labocatoiy  as  a  wiKie  (including  die  LIMS)  was  described 
and  modelled  before  staiting  die  safety  analysis.  The  safety  analysis  produced  an 
assessment  of  die  safety  of  the  data  handling,  and  an  associated  list  of  hazards.  The 
sdety  analysis  of  the  LIMS  is  described  in  an  accompanying  pi^  and  in  [4].  Tliis 
description  of  the  LIMS  and  die  list  (€  hazards  was  used  as  the  basis  for  the  formal 
^ledfication  and  re-development  of  die  LIMS. 

The  entire  LIMS  was  formally  qiecified  using  RAISE.  Safety  propoiies  that 
remove  or  at  least  constrain  the  hazards  identified  by  die  safety  analysis,  were  also 
described  and  captured  in  the  spedlicadon.  These  properties  can  be  thought  oi  as 
forming  a  safety  case  for  die  LIMS.  As  RAISE  is  madimnatically  based,  diese 
safety  properties  are  described  as  constraints  on  the  behaviour  of  the  LIMS. 

Several  components  of  the  systnn,  identified  in  the  qiedfication,  have  been 
selected  for  furdiCT  developmenL  The  selecdon,  based  on  the  safety  assessment  of 
die  hdioraiory,  wiU  redevelop  a  complete  analytical  loop  of  the  UMS  -  from  the 
iiqiut  of  requests  to  the  printing  of  results.  The  development  will  be  (rigorously) 
verified  to  ensure  diat  die  safety  properties  identified  in  the  qiecification  are 
maintained.  It  is  intended  diat  another  safety  analysis  of  die  UMS  be  carried  out 


after  die  development  to  assess  whether  the  re-development  of  the  UMS  has 
improved  the  safety  of  die  LIMS. 

3.1.  Additioiial  safely  consideralioiis 

Ensuring  that  the  LIMS  functions  ccHrecdy  and  preserving  the  integrity  of  the  rfam 
entrusted  to  it,  is  only  part  of  what  is  done  to  ensure  that  the  laboratory  will  meet  it 
goal.  The  rdiability  of  the  LIMS  and  of  die  labmatory  as  a  whole  must  be 
oonsidaed,  ensuring  diat  results  will  be  returned  to  patients  in  time.  This  will 
include  the  backing  up  of  data,  the  presence  of  standby  machines,  contingency 
fdans  for  staff  illness,  etc. 

The  accuracy  of  the  chemical  analysis  must  also  be  considered.  In  the  UK,  this 
is  indqiendaidy  assessed  tty  a  central  body.  In  this  paper,  however,  only  the  LIMS 
and  its  workings  will  be  considered. 

4.  Some  results  of  the  safety  analysis 

bi  the  safety  analysis,  several  key  areas  were  identified.  Two  of  diese  will  be 
discussed  hoe:  the  identification  of  patients  from  information  on  the  request  forms, 
and  the  idendficadmi  and  collation  of  requests,  samples  and  results. 

4.1.  Identificadon  of  patients 

A  database  (rf  all  the  patients  who  have  beo  to  the  hospital  is  kq)t  on  a  central 
patient  administration  systent  Each  patient  in  this  database  has  a  unique  hospital 
number.  If  a  request  for  a  patient  that  has  been  to  the  hospital  before,  or  is  currendy 
in  the  hospital,  is  received,  the  patient's  details  are  retrieved  from  the  database.  If 
die  patient's  details  caimot  be  found  -  if  the  patient  has  not  been  to  the  hospital,  or 
no  match  am  be  made  -  a  new  number  is  assigned  to  the  patient 

It  is  exuemely  inqiartant  that  this  matdiing  of  details  is  accurate.  If  the  wrong 
match  is  made,  wrong  details  will  be  used.  Coosequendy,  the  oiteria  for  matching 
details  are  (and  must  be)  strict  If  a  satisfactory  match  cannot  be  made,  the  patient 
is  treated  as  new,  rather  allowing  wrong  details  to  be  used. 

4.2.  Identification  and  collation  of  requests,  samples  and  results 

Sanqiles  and  requests  are  assigned  a  unique  label  or  identifier  when  they  are 
received  at  die  laboratory.  This  label  makes  it  possible  to  distinguish  samples  fiom 
different  patients,  and  is  used  to  track  requests  and  samples  in  the  laboratory  and 
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for  matching  up  results  widi  the  impropriate  requests.  This  labdling  is  essential  to 
die  working  of  die  laboratory. 

5.  Specification  of  the  system 

The  main  activities  carried  out  in  a  latxnatory  are  described  below:  receiving 
requests,  analysing  samples,  and  collation  and  printing  results.  A  simple  model  of 
the  UMS  is  defined,  describing  how  the  UMS  supports  these  activities.  Using  this 
model  cf  a  UMS,  we  will  introduce  smne  constraints  on  the  system.  These 
constraints  will  captue  smne  of  the  safety  inopoties  of  a  LIMS. 

This  model  will  contain  a  simple  inftmnatimi  model,  detailing  the  minimimi 
amount  oi  infmnation  that  is  needed  for  a  UMS  to  work  ^actively.  Consklnable 
detail  is  (Knitted  1^  using  under-pe(^cati(m,  while  still  defining  the  essential 
properties  of  the  LIMS. 

In  this  paper,  the  model  of  the  UMS  will  he  sketched  out  using  RAISE, 
defining  only  the  signatures  (rf  functions  for  the  most  part  The  RAISE  is  not 
oonplete,  and  not  all  the  modules  have  been  included. 

Several  activities  carried  out  in  the  lalxKatny  have  not  been  described  here.  In 
particular,  the  validation  of  the  results  has  been  omitted.  During  validation,  the 
accuracy,  comfdeteness  (no  results  missing),  and  internal  consistency  of  the  results 
is  riiecked.  The  archiving  of  results  has  also  been  omitted.  This  is  in  no  way 
intended  to  imply  diat  diese  activities  are  not  essential  in  the  laboratory. 

5.1.  Recdving  requests  and  samples 

When  samples  and  requests  are  received  at  the  laboratory,  they  are  assigned  a 
unique  label.  A  request  will  contain  at  least:  the  patient  details,  and  a  list  the 
tests  to  be  carried  out  on  the  sample,  as  described  in  DATA_MODELO. 

•Chaim  OAT A_MCX)EL0  s  class 
typs 

PatisntOstails,  Sample,  Testid, 

Rec|us8t 

_(patient_detail8 :  PatientOetaito,  test.requests :  Testid*  ) 

end 

Normally,  requests  will  also  contain:  the  name  of  the  referring  d(x:tor,  the 
location  for  of  the  doctor,  the  time  and  date  of  sampling,  any  relevant  clinical 
information,  the  type  cf  the  pecimen,  and  any  special  information  or  precaution 
relevant  to  specimen  collection  or  handling. 

We  will  also  assume  that  each  request  is  related  to  a  single  sample.  In  practice 
this  is  not  be  the  case  and  a  request  can  relate  to  several  different  samples. 
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5.1.1.  Ident^ing  the  patient 

When  a  request  is  recdved,  it  will  nonnally  contain  the  patient's  surname,  initials, 
dme  of  birth  and  possibly  a  hospital  number.  For  this  paper,  a  match  can  be  made, 
tnatch_found,  if  the  patioit's  hospital  numb^  and  surname  match,  or  if  the 
surname,  initials  and  date  of  birdi  match. 

adwnw  DATA.MODEL1  s  axiand  DATA_MCX)ELO  with  claaa 
typa  HoapitalNum,  Data,  Name,  Initials 
valua  no_num :  HospitalNum 
valua  patiant.id :  Request  ->  HospitalNum 
valua 

surname :  PatientDetails  Name, 
initials :  PatientDetails  -» Initials, 
dob :  PatientDetails  ->  Date 

and 

no_num  is  used  to  denote  no  hospital  numb^  on  the  request  form.  The  patient 
database  is  fqxesoited  as  a  mapping  fitom  HospitalNum  to  PatientDetails. 
achanM  LIMS1(D :  DATA.MODEL1) »  ciasa 

varlaMa  patient_db :  D.HospitalNum  D.PatientDetails 
valua 

/*  Match  using  hospital  nurrbsr.  V 
match.id :  D.Request  read  any  Bool 
matchjd(r)  s  let  i  =  D.patientjd(r}  In 

let  req_details  s  D.patient_details(r),  db.details  s  patient_db(i)  in 
D.8umame(req_details)  =  D.sumame(db_details) 
and  and 

pra  D.patientjd(r)  €  dom  patient_db, 

/*  Match  using  patient  details.  V 
match.details :  D.Request  -»  read  any  Bool 
match_details(r)  ^  ( Bi :  D.Ho^KalNum  •  i  £  dom  patient_db  a 
let  req_details  s  D.patient_detail8(r),  db.details  =  patient_db(i)  In 
D.8umame(req_detail8)  s  D.sumame(db_detail8)  a 
D.initial8(req_details)  =  D.initicri8(db_detail8)  a 
D.dob(db_detail8)  s  D.dob(req_detail8)  and ) 
pra  D4>atientjd(r) «  D.no.num  v  D.patientjd(r)  a  dom  patient_db 
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V«llM 

matchjound :  D.ftequMt  -¥  raad  any  Bool 
match_found(r)  b 

If  D.patMiitJd(r)  €  dom  pationt.db  than  match_id(r) 
alsa  match_dataits(r)  and 

axiom  ldbasB_con8ist0ncy1]  D.no_num  a  dom  pati«nt_db 

and 

For  die  database  to  be  oonsistent,  no_num  cannot  be  used  to  identify  a  patient's 
deOttls. 

5.1. Z  IdeiiS^ing  the  requests  and  the  samples 

For  each  lequest  and  acconqianying  sample  that  is  received,  a  new  label  is  created 
and  assigned  to  that  request  The  details  oi  the  request  with  this  label  are  entered 
into  die  LIMS,  enter  Jest.  As  die  collation  dl  results  with  requests  d^ends  on  the 
label  being  unique,  die  LIMS  must  guarantee  that  different  requests  with  the  same 
label  cannot  be  entered. 

theory  TH_LIMS2 :  axiom 

/^Assuming  no  dupUcsdn^  order Jtest  will  not  create  a  labelled  V 
/*  request  which  has  the  same  label  as  another  request.  V 
In  clasa  ob|act  0 :  OATA_MOOEL2,  L :  LIMS2(0) 
valua  /*  Check  that  no  two  requesbs  have  the  same  label.  V 
no.duplicatee :  Unit  ->  road  any  Bool 
and  I- 

VI :  D.Label,  r :  0.  Request  •  L.enter_te8t(l,  r) ;  no.duplicatesO  » 
Lenter_test(l,  r) ;  true  pra  no_duplicates()  a  LlabeLu8ed(l) 

and 

5.2.  Analysing  the  samples 

After  the  sanqiles  are  labelled,  diey  are  {xepared  for  analysis  and  are  passed  to  the 
aoalysn.  For  each  sanqile  recdved  by  the  analyser,  it  must: 

•  identify  the  sample,  using  the  lidiel  on  the  sample, 

•  get  the  list  of  tests  required  for  die  sample, 

•  perform  die  necessary  tests,  and 

•  return  die  results  of  the  tests. 

The  result  of  a  test  must  contain  details  rA  what  test  was  done.  Furthmnore,  each 
result  must  also  be  labeled  with  die  same  label  as  the  request  so  that  the  two  can  be 
coUated. 
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•chMiit  DATA_M0DEL3  s  exiMid  DATA_M0DEL2  with  class  type 
AnalyserRequest »  _(label :  Label,  testjd :  Teetid), 
TestResult  =*  _(te8t_id :  Testid), 

AnalyaarReeult  ss  _(lab6l :  Label,  test_re8ult :  TestResult) 

end 


53.  CoUating  and  printing  the  results 

After  the  saiiq>les  are  analysed,  the  results  must  be  collated  with  the  tests  to  ensure 
diat:  all  die  tests  have  been  performed,  and  that  only  the  tests  wanted  have  been 
done.  The  results  of  die  tests  are  dien  collated  with  the  padent  details  into  a  rqiort 
This  rqiort  is  printed  and  is  sent  to  the  patient's  doctor.  This  report  must  contain 
the  same  labd  as  the  request  to  check  that  die  request  has  been  complied. 

It  can  be  shown  in  this  model  that  once  a  label  has  been  assigned  to  a  request, 
it  is  never  changed.  Ftirthermme,  the  same  label  is  assigned  to  each  test  and  eadi 
result  associated  with  that  request 

sctwnw  DATA_MODEL4  =  extend  DATA_MODEL3  with  claee  type 
Report  .(label :  Label,  patient.details :  PatientDetails, 
teat.reaults :  TestResult*  )  end 

In  this  model,  it  is  assumed  that  repmts  are  only  [Minted  when  all  the  tests  have 
been  completed.  In  [Macdce  this  is  not  the  case  as  some  results  may  be  needed 
urgendy  and  diese  will  be  sent  to  die  patient’s  doctor  as  soon  as  diey  are  ctxnpleted. 
However,  it  can  be  shown  for  this  model  (proved  formally  if  necessary)  that  reports 
ate  only  printed  after  all  the  tests  have  been  oxnpleted. 
theory  TH_LIMS4_1 :  axiom 
In  claee  obfect  0  :  DATA_MCX)EL4,  L :  LIMS4(D)  end  I- 
VI :  D.Labal  •  Lr6port_isj3rintad(l)  L.tests_completed(l)  end 
A  further  safety  propoty  of  the  UMS  is  that  spurious  reports  are  not  generated  - 
only  rqiorts  for  requests  received  by  the  laboratory  will  be  printed, 
theory  TH_LIMS4_2 :  axiom 
In  claae  obfect  D :  DATA_MCX)EL4.  L :  LIMS4(D)  end  I- 
{ D.labeKx)  I  x :  D.Report  •  x  €  L.printad }  c 
{ D.labei(y)  I  y :  D.LabelladRequest  •ye  L.requests }  end 

6.  Concluding  conunents 

This  paper  describes  how  the  muld-disdplanary  apfvoach  advocated  1^  the 
MORSE  project  has  been  rqiplied  in  re-developing  the  UMS  at  the  WMH.  In 
particular,  the  use  of  the  formal  method  RAISE  is  dmnonstrated  by  defining  a 
model  of  a  sinqilified  UMS.  In  this  model,  both  the  informadon  and  the 
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fiiDctioiiality  lequired  by  the  UMS  to  the  woik  done  in  the  labofatory  are 

d^ned. 

By  using  RAISE,  die  pioperties  of  the  UMS  can  be  investigated  at  the 
^ledfication  stage  of  the  ie-develq[nnent  (as  shown  here),  rather  at  later  stages  of 
die  develofment  Stxne  safety  properdes  of  the  UMS  are  also  defined,  showing 
how  hazards  can  be  ranoved  (or  at  least  reduced).  Furthenn(M«,  by  using  RAISE, 
one  can  prove  diat  diese  safety  properties  are  maintained  throughout  the  re¬ 
development  of  the  UMS  -  from  its  specification  to  its  implementation.  This 
ensures  that  hazards  removed  in  the  ^lecification  of  the  LIMS  are  ranoved  from 
the  implementation  of  the  UMS.  The  effectiveness  of  this  approach  will  be 
assessed  after  the  second  safety  analysis  is  carried  out  after  the  re-development  is 
comfdeted. 
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Abstract 

The  task  of  safeguarding  systems  is  to  bring  processes  from  dan¬ 
gerous  into  safe  states.  A  special  class  of  safeguarding  systems  are 
emergency  shut-down  systems  (ESD),  which,  until  now,  are  only 
implemented  in  inherently  fail  safe  hard  wired  forms.  Despite  their 
high  reliability,  there  is  an  urgent  industrial  need  to  replace  them 
by  more  flexible  systems.  Therefore,  a  low  complexity,  fault  detect¬ 
ing  computer  architecture  was  designed,  on  which  a  programmable 
logic  controller  for  ESD  applications  can  be  based.  Functional  logic 
diagrams,  the  traditional  graphical  specification  tool  of  ESDs,  are 
directly  supported  by  the  architecture  as  appropriate  user  oriented 
programming  paradigm.  Thus,  by  design,  there  is  no  semantic  gap 
between  the  programming  and  machine  execution  levels  enabling 
the  safety  licensing  of  application  software  by  formal  methods  or 
back  translation.  The  concept  was  proven  feasible  by  a  working 
demonstration  model. 


1  Introduction 

Many  technical  systems  have  the  potential  of  disastrous  effects  on,  for  instance, 
the  environment,  equipment,  employees,  or  the  general  public  in  case  of  mal¬ 
functions.  An  important  objective  of  the  design,  construction,  and  commis¬ 
sioning  of  such  systems  is,  therefore,  to  minimise  the  chances  that  hazards 
occur.  One  possibility  to  achieve  this  goal  is  the  installation  of  a  system  whose 
only  function  is  to  supervise  a  process  and  to  take  appropriate  action  if  any¬ 
thing  in  the  process  turns  dangerous.  So,  to  prevent  hazards,  many  processes 
are  guarded  by  these  so  called  safeguarding  sgstems.  A  special  kind  of  them 
systems  are  Emergency  Shut-Down  systems  (ESD),  which  are  defined  as: 
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A  system  that  monitors  a  process,  and  only  acts  —  i.e.,  guides  the 
process  to  a  static  safe  state  (generally,  a  process  shut-down)  —  if 
the  safety  of  either  human  beings,  the  environment,  or  investments 
is  at  stake. 

The  mentioned  monitoring  consists  of  observing  whether  certain  physical  quan¬ 
tities  such  as  temperatures  or  pressures  stay  within  given  bounds  and  to  super¬ 
vise  Boolean  quantities  for  value  changes.  Typical  ESD  actions  are  opening  or 
closing  valves,  operating  switches  etc.  Structurally,  ESDs  are  functions  com¬ 
posed  of  Boolean  operators  and  delays.  The  latter  are  required,  because  in 
start-up  and  shut-down  sequences  often  some  monitoring  or  actions  need  to  be 
delayed.  Originally,  safeguarding  systems  were  constructed  pneumatically  and 
later,  e.g.,  in  railway  signaling,  with  electromagnetical  relays.  Nowadays,  most 
systems  installed  are  based  on  integrated  electronics  and  there  is  a  tendency 
to  use  microcomputers. 

The  current  (electrical)  systems  used  for  emergency  shut-down  purposes  are 
hard  wired  and  each  family  makes  use  of  a  certain  principle  of  inherently  fail 
safe  logic.  The  functionality  of  an  ESD  system  is  directly  implemented  in 
hardware  out  of  building  blocks  for  the  Boolean  operators  and  delays  by  in¬ 
terconnecting  them  with  wires.  These  building  blocks  are  fail  safe,  i.e.,  any 
internal  failure  causes  the  outputs  to  assume  the  logically  false  state.  Unless 
implemented  wrongly,  this  results  in  a  logically  false  system  output,  which  in 
turn  causes  a  shut-down.  Thus,  any  failure  of  the  ESD  system  itself  will  lead 
to  a  safe  state  of  the  process  (generally  s  process  shut-down).  This  technol¬ 
ogy,  used  successfully  for  decades  now,  has  some  very  strong  advantages.  The 
simplicity  of  the  design  makes  the  hardware  very  reliable.  The  one-to-one  map¬ 
ping  of  the  client’s  specification  expressed  as  functional  logic  diagrams  (FLD)  to 
hardware  modules  renders  implementation  mistakes  virtually  impossible.  The 
"programming”  consists  of  connecting  basic  modules  by  means  of  wires,  stress¬ 
ing  the  static  nature  of  such  systems.  Finally,  the  fail  safe  character  of  hard 
wired  systems  is  a  very  strong  advantage.  But  there  are  also  disadvantages 
that  gave  rise  to  the  work  reported  here. 

Economical  considerations  impose  stringent  boundary  conditions  on  the  de¬ 
velopment  and  utilisation  of  technical  systems.  This  holds  for  safety  related 
systems  as  well.  Since  manpower  is  becoming  increasingly  expensive,  also  safety 
related  systems  need  to  be  highly  flexible,  in  order  to  be  able  to  adjust  them 
to  changing  requirements  at  low  costs  within  short  times.  In  other  words, 
safety  related  systems  such  as  ESDs  must  be  program  controlled  in  order  to 
relinquish  hard  wired  logic  from  taking  care  of  safety  functions  in  industrial 
processes.  Owing  to  their  simplicity,  the  most  promising  alternative  to  hard 
wired  logic  in  ESD  systems  are  programmable  logic  controllers  (PLC),  which 
can  provide  the  same  functionality.  However,  although  a  reasonable  hardware 
reliability  can  be  obtained  by  redundancy,  constructing  dependable  software 
constitutes  a  serious,  still  unsolved  problem. 


There  is  already  a  number  of  established  methods  and  guidelines,  which  have 
proven  their  usefulness  for  the  development  of  highly  dependable  software  em¬ 
ployed  for  the  control  of  safety  critical  technical  processes.  Prior  to  its  ap¬ 
plication,  such  software  is  further  subjected  to  appropriate  measures  for  its 
verification  and  validation.  However,  according  to  the  present  state  of  the 
art,  these  measures  cannot  guarantee  the  correctness  of  larger  programs  with 
mathematical  rigour.  Prevailing  legal  requirements  demand  that  object  code 
must  be  considered  for  the  correctness  proofs  of  software,  since  compilers  are 
themselves  far  too  complex  software  systems,  as  that  their  correct  operation 
could  be  verified.  Depending  on  national  legislation  and  practice,  the  licens¬ 
ing  authorities  are  still  very  reluctant  or  even  refuse  to  approve  safety  related 
systems,  whose  behaviour  is  exclusively  program  controlled. 

In  order  to  provide  a  remedy  for  this  unsatisfactory  situation,  it  was  the  pur¬ 
pose  of  the  work  reported  here  to  develop  a  special  —  and  necessarily  simple 
—  computer  system  in  the  form  of  a  programmable  logic  controller,  which  can 
carry  out  safety  related  functions  as  required  in  emergency  shut-down  systems. 
The  leading  idea  followed  throughout  this  design  was  to  combine  already  ex¬ 
isting  software  engineering  and  verification  methods  with  novel  architectural 
support.  Thus,  the  semantic  gap  between  software  requirements  and  hardware 
capabilities  is  closed,  relinquishing  the  need  for  not  safety  licensable  compilers 
and  operating  systems.  By  keeping  the  complexity  of  each  component  in  the 
system  as  low  as  possible,  the  safety  licensing  of  the  hardware  in  combination 
with  application  software  is  enabled  on  the  basis  of  well  established  and  proven 
techniques. 

2  The  Software  Building  Blocks 

All  emergency  shut-down  systems  can  be  constructed  from  a  set  of  function 
modules  containing  just  four  elements,  viz.,  the  three  Boolean  operators  And, 
Or,  Not  and  a  timer.  For  reasons  of  simplicity  we  restrict  the  number  of  inputs 
for  both  And  and  Or  to  two.  The  functionality  is  not  effected,  since  any  multiple 
input  function  can  be  described  with  a  finite  number  of  the  two  input  gates. 
It  is  also  sufficient  to  use  only  one  type  of  timer.  All  other  forms  of  timers 
used  in  hard  wired  logic  can  be  implemented  by,  if  need  be,  adding  inverters. 
The  timer  has  one  Boolean  input,  7,  one  Boolean  output,  O,  an  adjustable 
delay  time,  t,  and  an  internal  state,  d,  with  0  <  d  <t.  Its  functionality  can  be 
informally  described  as  follows: 

•  Initially,  the  input  is  false,  the  output  is  false,  and  the  internal  counter, 
d,  has  assumed  its  maximum  (as  set),  so  d  =  t. 

•  As  the  input  becomes  true,  the  output  remains  false  and  the  counter,  d, 
decreases,  i.e.,  the  timer  starts  counting  down. 
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Counter 

Input 

Output 

d-t 

true 

false 

0  <  d  <  t 

true 

false 

d=0 

true 

true 

d  =  t 

false 

false 

0  <  d  <  t 

false 

false 

d  =  0 

false 

false 

Table  1:  The  timer  output  as  function  of  input  and  counter  value 


•  As  soon  ^  the  counter  becomes  zero  and  the  input  is  still  true,  the  output 
turns  true. 

•  If  the  input  is  false,  after  having  been  irve  for  less  than  the  preset  delay 
time  t,  then  the  timer  is  reset.  That  is,  the  output  becomes  false  and  the 
delay  time  assumes  its  initial  (maximum)  value. 

•  If  the  input  becomes  false  after  d  is  0  and,  thus,  the  output  has  become 
true,  also  a  reset  operation  is  performed. 

We  observe  that  there  are  two  values  that  may  have  to  be  changed.  First, 
obviously,  the  lo^cal  output  could  change  as  a  function  of  the  input  and  the 
internal  state.  Secondly,  the  interr^!  state  may  need  updating,  depending  on 
both  the  logical  input  and  the  internal  state. 

Although  the  number  of  internal  states  of  the  timer  is  numerous,  three  interest¬ 
ing  ones  can  be  extracted,  viz.,  d  =  t,  d  =  0  and  0  <d  <t.  They  are  displayed 
in  Table  1. 

The  functionality  of  the  timer  can  be  represented  by  a  simple  Boolean  expres¬ 
sion  for  its  output: 

O  =  (d  =  <)  A  / 

What  the  module  still  lacks  is  a  realisation  of  time.  Hence,  we  define  time, 
with  <0  <  time  <  oo.  In  a  system  time  can  be  implemented  in  both  hardware 
or  software.  For  accuracy  reasons,  we  have  chosen  a  hardware  solution:  ftme 
is  implemented  in  form  of  a  counter  triggered  by  a  quartz  stabilised  time  base. 

An  implementation  of  the  four  functions  modules  discussed  above  has  been 
proven  correct  using  predicate  calculus  [1].  This  was  trivial  in  the  case  of  the 
Boolezui  functions.  The  correctness  proof  of  the  timer  was  straightforward, 
but  took  a  few  pages.  The  interested  reader  is  referred  to  [2],  because  size 
restrictions  prohibit  to  include  the  proof  into  this  article. 
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3  The  Software  Engineering  Paradigm 

The  analysis  of  functional  logic  diagrams  suggests  to  introduce  a  new  pro¬ 
gramming  paradigm,  viz.,  to  compose  software  out  of  high  level  user  oriented 
building  blocks  instead  out  of  low  level  machine  oriented  ones.  Whereas  a  single 
machine  instruction  taken  out  of  a  program  context  does  not  reveal  its  purpose, 
the  occurrence  of  a  certain  function  module  instance  usually  gives  already  a 
clue  about  the  problem,  its  solution,  and  the  module’s  role  in  it. 

The  development  of  ESD  software  is  carried  out  by  process  engineers  in  the  tra¬ 
ditional  way  of  drawing  FLDs.  The  latter  describe  the  mapping  from  Boolean 
inputs  to  Boolean  outputs  as  functions  of  time  such  as,  e.g., 

if  a  pressure  is  too  high  then  a  valve  should  be  opened  and  an 
indicator  should  light  up  after  5  seconds. 

In  Figure  1  an  example  of  a  FLD  is  given.  The  FLD  describing  the  functionality 
of  an  average  ESD  system  contains  thousands  of  blocks,  laid  out  over  many 
drawing  sheets. 


Figure  1:  An  example  of  a  FLD  (with  dyadic  Boolean  operators  only) 

This  specification  level  programming  method  consists  of  graphically  intercon¬ 
necting  instances  of  the  above  mentioned  four  basic  function  modules  with  each 
other  by  lines,  i.e.,  single  basic  functions  are  invoked  one  after  the  other  and,  in 
the  course  of  this,  they  pass  parameters.  The  interconnections  between  func¬ 
tion  blocks  have  to  meet  just  one  restriction:  each  input  must  be  connected  to 
exactly  one  output.  Besides  the  provision  of  constants  as  external  input  pa¬ 
rameters,  the  basic  functions’  instances  and  the  parameter  flows  between  them 
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are  the  only  language  elements  required  by  this  programming  paradigma. 

A  compiler  transforms  the  graphically  represented  program  logic  into  object 
code.  Owing  to  the  simple  structure,  this  logic  is  only  able  to  assume,  the  gen¬ 
erated  programs  contain  no  other  features  than  sequences  of  procedure  calls 
and  some  internal  moves  of  data.  The  verification  of  the  compiler  transform¬ 
ing  the  graphiczd  software  representation  into  object  code  is  still  impossible  — 
but  also  not  necessary,  because  for  FLD  software  only  the  module  interconnec¬ 
tions  need  to  be  verified.  As  outlined  below,  for  this  task  the  architecturally 
supported  method  of  back  translation  is  employed. 


4  The  Architectural  Concept 

In  order  to  facilitate  the  conceivability  of  the  implemented  software  and  of  its 
execution  process,  we  design  an  architecture  for  an  ESD  oriented  programmable 
logic  conti oiler  with,  conceptually,  two  processors: 

•  a  control  flow  processor  (master)  and 

•  a  basic  function  block  processor  (slave). 

Thus,  we  achieve  a  clear  and  physical  separation  of  concerns:  execution  of  the 
basic  function  modules  in  the  slave  processor  and  all  other  tasks,  i.e.,  execution 
control,  sequential  function  chart  processing,  and  function  module  invocation, 
assigned  to  the  master.  This  concept  implies  that  the  application  code  is  re¬ 
stricted  to  the  control  flow  processor,  on  which  the  project  specific  safety  li¬ 
censing  can  concentrate.  Special  architectural  support  for  the  cyclic  operating 
mode  of  programmable  logic  controllers  is  implemented  in  the  master  processor. 
To  enable  the  detection  of  faults  in  the  hardware,  a  dual  channel  configuration 
has  been  chosen,  which  supports  diversity  in  form  of  different  master  processors 
and  different  slave  processors. 

At  least  one  of  the  master  processors  should  have  the  most  simple  organisation 
possible  for  the  considered  application  requiring  only  two  instructions,  one  of 
which  is  MOVE.  The  other  one  implements  a  special  architectural  support  for 
the  cyclic  operating  mode  of  PLCs.  Since  only  one  step  is  active  at  any  given 
time,  a  memory  protection  mechanism  prevents  the  erroneous  access  to  the 
program  code  of  the  inactive  steps.  The  STEP  instruction  is  the  only  means  to 
perform  a  branch.  It  solely  allows  to  return  to  the  initial  address  of  the  active 
step’s  program  code  if  the  corresponding  transition  condition  is  not  fulfilled. 

The  capabilities  of  the  slave  need  to  be  somewhat  more  complex  and  are  implied 
by  the  operations  of  the  four  basic  function  modules.  The  objective  of  a  PLC 
for  ESDs  suggested  to  employ  the  VIPER  [3]  chip  in  the  slave,  because  it 
is  the  only  available  microprocessor  whose  design  has  been  formally  proven 
correct.  The  slave  processor  performs  all  data  manipulations  and  takes  care 
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of  the  communication  with  the  environment.  It  has  no  program  RAM,  but 
only  executes  the  basic  function  modules  whose  code  is  contained  in  firmware 
ROMs. 

To  recognise  hardware  faults,  all  processing  is  simultaneously  performed  on 
two  master/slave  pairs.  A  number  of  comparators  checking  the  outputs  from 
the  master  processors  before  they  reach  the  slaves  and  vice  versa  completes 
a  fault  detecting  two-channel  configuration.  The  master  and  slave  processors 
communicate  with  each  other  through  two  FIFO-queues.  They  execute  pro¬ 
grams  in  co-ordination  with  each  other  as  follows.  The  master  processor  lets 
the  slave  execute  a  function  block  by  sending  the  latter’s  identification  and  the 
corresponding  parameters  and,  if  need  be,  also  the  block’s  internal  state  values 
via  one  of  the  FIFO-queues  to  the  slave  processor.  Here  the  object  program 
implementing  the  function  block  is  performed  and  the  generated  results  and 
new  internal  states  are  sent  to  the  master  processor  through  the  other  FIFO- 
queue.  The  elaboration  of  the  function  block  ends  with  fetching  these  data 
from  the  output  FIFO-queue  and  storing  them  in  the  master’s  memory.  To 
avoid  faults  during  operation,  the  function  modules’  object  code  is  put  in  the 
slave’s  read-only  program  memory,  after  the  correctness  of  the  code  has  been 
established.  The  master /slave  configuration  has  been  chosen  to  physically  sep¬ 
arate  two  system  parts  from  one  another:  one  whose  software  only  needs  to 
be  verified  once,  and  the  other  one  performing  the  application  specific  part  of 
the  software.  Needless  to  say,  that  the  latter  requires  indvidual  safety  licens¬ 
ing.  This  concept  implies  that  FLDs  are  solely  mapped  onto  the  control  flow 
processor,  to  which  project  specific  safety  licensing  can  be  restricted.  Figure  2 
gives  a  conceptual  diagram  of  the  master/slave  PLC  architecture. 
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Figure  2:  Configuration  of  a  PLC  with  master/slave  processors 

In  available  PLCs,  the  execution  time  for  a  step  generally  varies  from  one  cycle 
to  the  next  depending  upon  the  program  logic  performed  and  the  external  con¬ 
ditions  evaluated  each  time.  Therefore,  the  measurement  of  external  signals 
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and  the  output  of  values  to  the  process  is  usually  not  carried  out  at  equidis- 
tantly  spaced  points  in  time,  although  this  may  be  intended  in  the  control 
software.  To  achieve  full  determinism  of  the  time  behaviour  of  programmable 
logic  controllers,  a  basic  cycle  is  introduced.  The  length  of  the  cycle  is  selected 
in  a  way  as  to  accommodate  during  its  duration  the  execution  of  the  most 
time-consuming  step  occurring  in  an  application.  It  is  supervised  that  the  exe¬ 
cution  time  of  a  step  does  not  exceed  this  cycle  period  by  awaiting,  at  the  end 
of  the  step’s  program  processing  and  after  the  evaluation  of  the  corresponding 
transition  condition(s),  the  occurrence  of  a  clock  signal,  which  marks  the  begin 
of  the  next  cycle.  An  overload  situation  or  a  run  time  error,  respectively,  is 
encountered  when  the  clock  signal  interrupts  an  active  application  program.  In 
this  case  a  suitable  error  handling  has  to  be  carried  through.  Although  the  in¬ 
troduction  of  the  basic  cycle  exactly  determines  a  priori  the  cyclic  execution  of 
the  single  steps,  the  processing  instants  of  the  various  operations  within  a  cycle, 
however,  may  still  vary  and,  thus,  remain  undetermined.  Since  a  precisely  pre¬ 
dictable  timing  behaviour  is  only  important  for  input  and  output  operations, 
temporal  predictability  is  achieved  as  follows.  All  inputs  occurring  in  a  step 
are  performed  en  bloc  at  the  beginning  of  the  cycle  and  the  thus  obtained  data 
are  buffered  until  they  will  be  processed.  Likewise,  all  output  data  are  first 
buffered  and  finally  sent  out  together  at  the  end  of  the  cycle. 


5  Safety  Licensing 

With  the  implementations  of  all  four  basic  function  blocks  employed  in  FLDs 
having  been  proven  correct  and,  as  parts  of  the  architecture,  being  invisible 
from  the  application  programming  point  of  view,  for  any  new  ESD  project 
only  the  proper  mapping  of  a  particular  interconnection  pattern  of  invoked 
function  block  instances  on  object  code  needs  to  be  verified.  For  this  purpose 
we  subject  the  object  code  loaded  into  the  master  processor  to  back  translation, 
a  safety  licensing  method  [4]  which  was  developed  in  the  course  of  the  Halden 
nuclear  power  plant  project  and  which  is  —  although  rigorous  —  essentially 
informal,  easily  conceivable,  and  immediately  applicable  without  any  training. 
Thus,  it  is  extremely  well  suited  to  be  used  on  the  application  programming 
level  by  people  with  the  most  heterogeneous  educational  backgrounds.  The 
ease  of  understanding  and  use  inherently  fosters  error  free  application  of  the 
method.  It  consists  of  reading  machine  programs  out  of  computer  memory  and 
giving  them  to  a  number  of  teams  working  without  any  mutual  contact.  All  by 
hand,  these  teams  disassemble  and  decompile  the  code,  from  which  they  finally 
try  to  regain  the  specification.  The  software  is  granted  a  safety  license  if  the 
original  specification  agrees  with  the  inversely  obtained  re-specifications.  Of 
course,  in  most  circumstances  the  method  is  extremely  cumbersome,  time  con¬ 
suming,  and  expensive.  This  is  due  to  the  semantic  gap  between  a  specification 
formulated  in  terms  of  user  functions  and  the  usual  machine  instructions  carry¬ 
ing  them  out.  Applying  the  programming  paradigm  of  basic  function  modules. 
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however,  the  specification  is  directly  mapped  onto  sequences  of  module  invoca¬ 
tions.  The  object  code  consists  of  just  these  calls  and  parameter  passing.  The 
implementation  details  of  the  function  modules  are  part  of  our  architecture. 

Thus,  they  are  invisible  from  the  application  programming  point  of  view  and 
do  not  require  safety  licensing  in  this  context.  Consequently,  back  translation 
ran  lead,  in  one  easy  step,  from  machine  code  back  to  the  problem  specification, 
which  is  given  in  the  form  of  FLDs.  For  our  architecture,  the  effort  required  to 
utilise  the  method  of  back  translation  is  by  several  orders  of  magnitude  lower 
than  for  the  von  Neumann  architecture. 

Back  translation  is  a  verification  method  to  the  carried  out  with  diverse  redun¬ 
dancy.  Originally,  this  called  for  different  teams  of  human  inspectors.  Since  in 
the  case  considered  here  there  is  only  one  rather  simple  inverse  analysis  step, 
we  are  optimistic  that  the  licensing  authorities  will  eventually  accept  the  fol¬ 
lowing  procedure.  Verification  by  inverse  analysis  is  carried  out  by  a  number 
of  different  programs,  which  should  be  proven  in  practice  but  do  not  need  to 
be  formally  verified.  Such  programs  are  to  yield  graphical  outputs.  An  official 
licensor  performs  the  inverse  documentation  as  well,  compares  his  results  with 
the  ones  of  the  verification  programs  and  with  the  original  graphical  applica¬ 
tion  program  under  inspection  and,  upon  coincidence,  issues  a  safety  license. 

Such  a  procedure  is  in  line  with  the  dependability  requirements  for  diversely 
redundant  programs  demanded  by  the  licensing  authorities  and  necessitates 
only  the  minimum  of  highly  expensive  human  involvement,  viz.,  one  licensor, 
who  is  always  indispensable  to  take  the  legal  responsibility  for  issuing  a  safety 
license. 

In  order  to  prevent  any  modification  by  a  malfunction,  in  our  safety  oriented 
architecture  all  programs  must  be  provided  in  ROMs.  For  practical  reasons,  ! 

generally  there  are  two  types  of  these  memories.  The  code  of  the  basic  func¬ 
tion  modules  resides  in  mask  programmed  ROMs.  On  the  other  hand,  the 
code  representing  FLDs  is  written  into  (E)PROMs  by  the  user.  This  part  of 
the  software  is  subject  to  project  specific  verification  to  be  performed  by  the 
licensing  authorities,  which  finally  still  need  to  install  and  seal  the  (E)PROMs 
in  the  target  PLCs. 

6  Conclusion 

In  our  society  there  is  a  growing  concern  for  safety  (which  goes  hand  in  hand 
with  the  increasing  awareness  for  the  environment).  This  has  important  con¬ 
sequences  for  the  assessment  of  program  controlled  systems.  One  has  begun  to 
realise  the  inherent  safety  problems  associated  with  software.  Since  it  appears 
unrealistic  to  abandon  the  use  of  computers  for  safety  critical  control  purposes 
—  on  the  contrary,  there  is  no  doubt  that  their  utilisation  in  such  applications 
is  going  to  increase  considerably  —  the  problem  of  software  dependability  is 
exacerbating. 
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In  a  constructive  way,  and  using  presently  available  methods  and  hardware 
technology  only,  in  this  paper  for  the  first  time  an  architecture  was  defined, 
which  enables  the  safety  licensing  of  a  complete  programmable  electronic  sys¬ 
tem  including  the  software.  The  measures  to  achieve  this  objective  were: 

•  using  hardware  as  much  as  possible,  but  not  necessarily  in  the  most 
(hardware-)  cost  efficient  way,  since  now  there  is  cheap  hardware  in  abun¬ 
dance  (the  additional  hardware  costs  are  equivalent  to  the  cost  of  a  soft¬ 
ware  engineers  for  about  half  a  day), 

•  utilisation  of  a  high  level,  graphical  software  engineering  method, 

•  clonng  of  the  semantic  gap  between  architecture  and  user  programming 
by  basing  the  software  development  on  a  set  of  function  blocks  with  ap¬ 
plication  specific  semantics, 

•  removal  of  compilers  from  the  chun  of  items  requiring  safety  licensing, 

•  avoiding  the  need  for  a  complex  operating  system,  and 

•  by  providing  a  feasible  application  level  and  architectural  support  for  the 
software  licensing  method  of  back  translation. 

Employing  VIPER  microprocessors,  we  have  built  a  prototype  of  the  PLC  ar¬ 
chitecture  described.  Its  utilisation  in  practice  showed  that  implementing  the 
functionality  of  a  hard  wired  ESD  system  with  our  PLC  architecture  is  feasi¬ 
ble,  and  that  the  programming  paradigm  based  on  formally  verified  function 
modules  can  render  error  free  software.  We  hope  that  the  concept  presented 
here  will  lead  to  the  replacement  of  hard  wired  systems  safeguarding  industrial 
processes  by  programmable  ones  executing  safety  licensed  and,  thus,  highly 
dependable  software. 
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Abstract 

This  paper  starts  with  a  general  description  of  the  AEG  Transpor¬ 
tation  Systems,  Inc.  Automatic  People  Mover  System.  Subse¬ 
quently,  the  specific  safety  requirements  of  the  ATP,  and  the  con¬ 
sequent  design  features  to  meet  these  requirements  are  described. 
Following  this  introduction,  details  of  the  relationship  between 
designer  and  certifier,  the  utilization  of  embedded  rules-based 
systems,  the  concurrence  of  the  design  and  certification  process, 
and  the  de-coupling  of  the  safety  functions  from  the  hardware  are 
given.  It  is  described  how  the  dramatic  improvements  in  the  tradi¬ 
tional  large  costs  and  long  schedules  normally  associated  with 
both  the  design  and  certification  of  safe  computer  systems  are 
made  possible. 


1  General  Description  of  the  Automatic  People  Mover 

1.1  Overview 

The  AEG  Automatic  People  Mover  System  consists  of  driverless  operated  trains 
which  usually  run  on  a  guideway  with  a  concrete  surface.  The  trains  are  guided  by 
an  I-Beam  and  may  be  configured  as  consists  of  one,  two  or  more  vehicles.  The 
access  of  passengers  to  the  guideway  is  prohibited  by  automatically  operated  station 
do(»s,  which  are  normally  closed.  They  open  in  synchronism  with  corresponding 
vehicle  doors  for  passenger  exchange  only  when  the  train  is  in  the  station  and  the 
doors  on  both  sides  (vehicle  and  station)  are  properly  aligned.  The  operation  of  the 
Automatic  People  Mover  System  is  controlled  and  supervised  by  the  Automatic 
Train  Control  System  (ATCS).  At  Frankfiut  Airport  the  headway  is  designed  to  be 
down  to  90  seconds.  The  Frankfurt  Airport  Passenger  Transfer  System  will  be  the 
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first  ^ifriication  of  die  new  Automatic  Train  Protection  system  design  which  is 
described  in  Uiis 

Automatk  Tniin  Control  System 

1.2.1  General  Overview 

The  Automatic  Train  Control  System  consists  of  three  major  computer-based 
subcomponents  with  different  functions.  The  Automatic  Train  Protection  (ATP) 
ensures  the  safety  of  operation  such  as  the  safety  of  moving  trains  or  passenger 
exchange  in  stadcms.  The  ATP  does  not  allow  unsafe  system  states.  The  Automatic 
Tram  Operation  (ATO)  provides  operational  control  of  train  speeds,  programmed 
station  stof^ing,  station  and  vehicle  door  qjeration,  as  well  as  passenger  audio  and 
visual  infimnation.  The  third  function,  the  Automatic  Train  Supervision  (ATS),  is 
responsible  for  the  supervision  of  the  Automatic  People  Mover  systems,  route  and 
headway  control,  and  repwting  of  alarms. 

The  Automatic  Train  Protection  system  consists  of  two  major  sub-systems,  the 
Wayside  ATP  and  the  Vehicle  ATP.  Each  is  based  on  vital  dual  channel  cross¬ 
checked  con^Miters  and  other  vital  I/O  hardware.  The  safety  functions  of  these  two 
subsystems  satisfy  specific  safety  requirements  on  the  ATP  as  identified  in  the 
Safety  Requirements  Catalog  (see  chapter  2  Safety  Requirements).  Below,  the 
allocation  of  safety  functions  to  either  Wayside  or  Vehicle  ATP  is  given. 

1.2.2  Wayside  ATP 

The  Wayside  ATP  fulfils  the  following  major  safety  functions: 

•  Detection  of  Trains 

•  Provision  of  Safe  Speed  Codes  (Selection  and  Transmission) 

•  Safe  Switch  Operation 

•  Safe  Station  Dow  Operation 

•  Vital  Inputs  and  Indications  for  the  Central  Control  Operator 

•  Protection  of  Maintenance  Vehicle 

1.2.3  Vehicle  ATP 

The  Vdiicle  ATP  serves,  among  others,  the  following  purposes: 

•  Reception  of  Speed  Commands 

•  Supo^sion  of  Actual  Train  Speed 

•  Supovision  of  Safe  Travel  Direction 

•  Safe  Vehkle  Door  Operation 

•  Safe  Station  Stopping 

•  Safe  Reaction  on  Unintentional  Train  Separation 
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2  Safety  Requirements 

2.1  Analyses  of  Potential  Hazards  and  their  Causes 

As  comprehensive  as  possible  all  potential  hazards  and  their  causes  are  identified  by 
means  of  a  Safety  Analysis.  In  the  early  phases,  this  analysis  c  Dmprises  the  High 
Level  Fault  Tree  Analysis  (HLFTA)  and  the  Preliminary  Hazards  Analysis  (PHA). 
Hazards  such  as  collisions  with  switches  or  other  trains,  end-of-line  run-through, 
passengers  gaining  access  to  the  guideway,  or  overspeed  in  curves  are  considered 
carefully  in  the  HLFTA  and  the  PHA.  These  analyses  provide  the  designers  of  the 
ATP  with  very  detailed  information  on  the  possible  causes  of  potential  hazards 
which,  in  turn,  provide  all  necessary  means  and  measures  to  protect  against  these 
hazards.  The  necessity  of  these  means  and  measures  is  documented  in  the  Safety 
Requirements  Catalog  (SFRC).  This  document  states  all  requirements  for  safety 
functions  which  need  to  be  fulfilled  by  the  ATP  in  order  to  ensure  safe  operation  of 
the  Automatic  People  Mover  System.  The  Safety  Requirements  Catalog  is  the  basis 
for  all  further  ATP-related  development  and  certification  steps. 

Correcting  the  problems  early  in  the  design  and  development  process  usually  is 
much  less  expensive  than  fixing  them  later  in  the  process.  As  the  hazards  are 
identified  very  early  in  the  design  and  development  process  by  the  above-mentioned 
Safety  Analysis,  the  requirements  are  determined  very  early  and  hence  additional 
costs  in  order  to  fix  problems  later  are  minimized. 

2.2  Results  of  Safety  Analysis  and  Consequent  Design  Principles 

Most  of  the  potential  hazards  identified  during  the  safety  analysis  might  lead  to 
injury  or  death  of  persons  or  damage  to  equipment.  Generally  speaking,  the  worse 
the  potential  consequences  of  the  hazards  are  the  more  rigorous  the  measures  to 
avoid  them  must  be. 

One  might  have  the  idea  to  differentiate  between  each  safety  relevant  function 
according  to  its  potential  damages  in  order  to  engage  different  means  and  measures 
to  protect  against  the  respective  risk.  But  many  of  the  functions  are  implemented  by 
the  use  of  software.  This  makes  it  difficult  to  ensure  that  functions  of  a  lower 
integrity  level  do  not  have  any  unsafe  impact  on  those  functions  which  protect 
against  higher  risks. 

Due  to  these  difficulties,  each  function  of  the  ATP  is  considered  to  be  of  high 
safety  relevance  and  hence  is  designed  in  a  vital  fashion.  Functions  which  have  no 
safety  relevance  are  allocated  to  the  Automatic  Train  Operation  System  (ATO).  For 
instance,  while  the  ATO  controls  the  doors  such  that  station  and  vehicle  doors  open 
synchronously,  the  ATP  ensures  safety  by  not  allowing  the  vehicle  to  move  if  either 
the  station  or  the  vehicle  doors  are  not  closed  and  locked.  Both  computers  are 
separated  such  that  the  non-vital  ATO  can  not  interfere  with  the  vital  ATP.  This 
strict  separation  of  vital  and  non-vital  functions  minimizes  the  verification, 
validation  and  certirication  effort  necessary  for  the  vital  ATP  and  for  the  entire 
Autonritic  Train  Control  System. 
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The  very  fundamental  regulations  for  the  certification  of  the  Frankfurt  Airport 
Automatic  People  Mover  are  BOStrab  [1]  and  DIN  VDE  0831  [2].  As  far  as 
DIN  VDE  0831 -the  German  standard  for  electrical  railway  signalling  systems-is 
applicable,  all  ATP  functions  are  designed  to  be  signal-safe.  In  other  cases,  where 
neither  BOStrab  nor  DIN  VDE  0831  are  immediately  applicable-in  case  of  software 
for  instance-other  German  or  international  standards  which  represent  the  current 
state  of  the  technology  are  applied.  This  is  required  by  BOStrab.  For  instance, 
DIN  V  19250  [3],  DIN  V  VDE  0801  [4]  or  MU  8004  [5]  are  applied.  According  to 
DIN  V  19250  [3],  the  ATP  is  categorized  as  of  Anwendungsklasse  (Integrity  Level) 
7.  Consequently,  applicable  recommendations  of  DIN  V  VDE  0801  are  followed. 

All  the  components  of  the  PTS  are  designed  with  an  inherent  high  level  of  reliabi¬ 
lity  so  as  to  deliver  an  overall  system  availability  of  99.65%  or  higher.  This  level, 
demonstrated  quantitatively,  is  established  by  contract  and  consistent  with  actual  le¬ 
vels  reached  in  numerous  operating  installations  of  the  AEG  people  movers. 

23  Design  of  Vital  Hardware-Examples 

The  safety  of  a  computer-based  train  control  system  depends  directly  on  the 
reliability  of  the  underlying  hardware.  Hie  hardware  must  fulfil  its  specified 
functions  correctly  and  safely.  Failures  must  not  lead  to  unsafe  states  [2].  Hence, 
failures  of  hardware  components  need  to  be  considered  carefully. 

Some  failures  can  be  excluded  by  the  application  of  vital  design  properties.  Other 
failures  cannot  be  excluded  and  need  to  be  detected  to  ensure  that  the  system  goes  to 
a  safe  state  in  case  of  such  failures.  The  following  sections  describe  these  vital  de¬ 
sign  principles  by  means  of  examples  taken  from  the  Automatic  Train  Protection 
system. 

2.3.1  Dual  Channel  Cross-Checked  ATP  Computers 

The  ATP  computers  which  have  to  ensure  the  safety  of  operation  are  designed  as 
dual  channel  cross-checked  computers.  Both  computers  cross-check  each  other 
continually.  Each  single  computer  channel  compaies  the  inputs  and  outputs  to  and 
from  the  actual  process  with  the  equivalent  data  from  the  other  channel  before  it 
allows  the  transition  from  one  safe  state  to  another  state.  If  one  computer  detects  a 
mismatch  of  cross-checked  data,  immediately  appropriate  actions  are  taken  to 
transfer  the  system  to  a  safe  state,  which  is  the  shut-down  of  the  concerned 
guideway  portion.  All  vehicles  in  this  section  will  come  to  a  stop  and  any  further 
movement  is  prohibited  by  the  ATP  until  personnel  has  fixed  the  problem  and  the 
ATP  checked  and  confirmed  that  safe  operation  is  restored. 

2.3.2  Occupancy  Detection  and  Speed  Code  Distribution  (TX/RX  System) 

The  Occupancy  Detection  of  the  Automatic  People  Mover  System  is  based  on  Track 
Circuits.  A  Track  Circuit  is  occupied  through  'short-circuit'  by  multiple  shunts  on 
each  vehicle.  The  Wayside  ATP  sends  the  appropriate  speed  commands  to  each  ve¬ 
hicle  via  the  Track  Circuits  and  the  Transmit/Receive  System  is  used  to  transmit  the 
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q)eed  codes  to  the  lYack  Circuits.  Each  Track  Circuit  is  fed  with  speed  commands 
even  when  there  is  no  train  currently  occupying  it.  If  there  is  no  train,  the  sent  speed 
codes  are  read  hack  by  the  Wayside  ATP.  They  are  then  compared  against  the  trans¬ 
mitted  codes.  If  they  do  not  match,  a  failure  in  the  Transmit/Receive  System  is 
assumed  and  a  shutdown  is  initiated.  This  ensures  a  periodic  test  and  failure  detec¬ 
tion  of  the  Wayside  ATP  components  engaged  in  the  Speed  Code  Distribution  and 
Occupancy  Detection.  Here,  the  fulfilment  of  Failure  Detection  Period 
Requirements  is  designed  into  the  system. 

2.3.3  Failure  Detection  Periods 

Other  components  may  not  be  guaranteed  to  be  tested  within  their  specific  failure 
detection  period  by  the  above-mentioned  approach.  Here,  other  measures  need  to  be 
taken  to  ensure  the  necessary  periodic  tests. 

One  approach  to  test  these  components,  for  instance,  is  that  the  ATP  performs  pe¬ 
riodic  checks  of  their  operation.  Since  the  respective  outputs  are  verified  by  a  read- 
back,  the  component  can  be  considered  tested  if  it  is  operated  and  no  failures  are 
encountered.  If  that  is  not  the  case,  the  ATP  alarms  this  fact  to  the  Central  Control 
Operator  in  order  to  take  appropriate  action. 

2.3.4  Failure  Exclusions  and  Fail-Safe  Design 

In  other  cases,  failures  are  excluded  by  the  use  of  special  design  properties,  such  as 
German  signal  relays  and  so-called  gravity  drop-out  relays.  Here,  the  correct 
function  of  the  relay  is  ensured  by  the  chosen  mechanical  design,  weights  and  gravi¬ 
ty.  Other  components  are  designed  in  such  a  way  that  all  failures  which  are  to  be 
assumed  will  always  lead  to  a  safe  state. 

2.4  Deyelc^ment  of  Safety  Relevant  Software 

The  safety  relevant  ATP  software  is  developed  using  the  high-level  language  Pascal. 
The  compiler  used  to  generate  the  object  code  is  validated  and  proven.  It  supports  a 
module  concept.  Hence,  the  ATP  software  is  designed  according  to  industry-accep¬ 
ted  design  principles  of  Information  Hiding,  Data  Abstractions,  High  Cohesion,  and 
Low  Coupling.  This  ensures  low  complexity  and  highly  comprehensible  modules. 
Consequently,  generation  and  verification  of  the  source  code  is  less  error-prone. 
Additionally,  these  design  principles  improve  the  maintainability  which  in  turn 
makes  changes  less  error-prone  and  results  in  less  effort.  The  verification  and  va¬ 
lidation  of  the  ATP  software  involves  rigorous  methods  which  are  described  in 
Section  4.4. 


3  Rules-Based  Interlocking  Engine 

Conventional  interlocking  systems  usually  are  designed  for  specific  applications.  In 
the  past,  difterent  guideway  layouts  or  extensions  to  existing  systems  led  to  conside- 


68 


rable  effort  to  customize  and  certify  each  separate  configuration  of  the  wayside  train 
protection  system.  Furthermore-in  case  of  extensions  to  existing  systems-this  ap¬ 
proach  led  to  additional  system  outages  during  the  exchange  of  components.  The 
Rules-Based  Interlocking  Engine  (RBIE)  [6]  has  been  designed  to  circumvent  these 
problems. 

3.1  Overview 

The  basic  principle  of  the  Rules-Based  Interlocking  Engine  is  to  make  the  imple¬ 
mentation  of  safety  functions  independent  from  the  underlying  hardware  and  core 
software  of  the  ATP.  The  ATP  becomes  a  generic  means  to  realize  safety  functions. 
This  is  accomplished  by  the  following  way.  A  description  of  the  actual  system  with 
application-specific  information  is  supplied  to  the  application-independent  Wayside 
ATP.  To  achieve  this,  a  description  of  the  system’s  layout  and  other  details  is  crea¬ 
ted.  This  description  is  called  the  Guideway  Definition  File.  The  Guideway  Defini¬ 
tion  File  is  placed  in  the  ATP  (See  Figure  1).  ITie  information  in  the  Guideway  De- 
Hnition  File  is  then  transformed  (parsed)  into  an  internal  representation  suitable  for 
interfvetation  by  the  Rules-Based  Interlocking  Engine  software.  During  runtime  the 
vital  decisions  of  the  Vital  ATP  Computer  are  based  on  generic  interlocking  rules. 
These  rules  are  evaluated  by  the  Rules-Based  Interlocking  Engine  regarding  the 
application  speciflc  relations  as  deflned  in  the  Guideway  Definition  File. 


Figure  1 :  Simplified  Architecture  of  the  ATP  Computer  with  focus  on  the  RBIE 

Because  of  this  strict  separation  of  application-specific  information  in  the  Guide¬ 
way  Definition  File  from  die  generic  Rules-Based  Interlocking  Engine  embedded  in 
the  Wayside  ATP,  the  generic  Rules-Based  Interlocking  Engine  is  certifiable  as  a 
type.  Once  the  Rules-Based  Interlocking  Engine  is  certified  according  to  the 
applicable  standards  and  regulations,  the  effort  for  the  development  and  certification 


of  the  interlocking  of  new  systems  is  reduced  to  efforts  related  to  the  generation  and 
certification  of  a  valid  guideway  definition  file. 

3^  Gukleway  Definition  FUe 

As  mentioned  above,  the  application  specific  information  on  the  people  mover 
application  which  the  ATP  is  intended  for  is  laid  down  in  the  Guideway  Definition 
File  in  a  comprehensible  and  easy-to-read  manner.  Among  others,  the  Guideway 
Definition  File  gives  information  on  the  guideway  layout,  arrangement  of  stations, 
allowable  speeds  and  i^lication-specific  obstacles  which  might  violate  the 
clearance  profiles  such  as  facade  washers. 

33  Validation  of  Guideway  Definition  File 

The  Guideway  Definition  File  is  a  well-structured  plain  text  and  easy-to-read  ASCII 
file.  The  language  is  especially  designed  to  be  comprehensible  to  people  mover 
experts  without  any  special  software  knowledge.  Hence  the  validation  of  the  Guide¬ 
way  Definition  File  may  concentrate  on  the  railroad  aspects.  Validation  may  be 
supported  through  the  graphical  exercise  of  the  information  contained  in  the  Guide¬ 
way  Definition  File. 

3.4  Advantages 

The  ai^oach  of  the  Rules-Based  Interlocking  Engine  leads  to  improvements  with 
respect  to  safety  as  well  as  economics.  The  validation  of  application-specific  design 
characteristics  is  much  less  error-prone  compared  to  conventional  approaches. 
Besides  the  safety  aspects  this  approach  is  very  cost-effective.  Once  the  ATP  with 
its  Rules-Based  Interlocking  Engine  is  type  certified,  time  and  efforts  spent  in  the 
configuration  and  certification  of  other  ATP  applications  are  reduced  dramatically. 

4  Project-Accompanying  Safety  Certification 

4.1  Relevant  Regulations  and  Standards 

The  first  application  of  the  described  Automatic  Train  Protection  system  is  the 
Frankfurt  Airport  Passenger  Transfer  System  (PTS).  The  system  has  been  designed 
according  to  German  regulations,  like  the  BOStrab  {1],  DIN  VDE  0831  [2]. 

In  cases  where  these  regulations  do  not  provide  sufficient  detail  regarding  the 
safety  requirements  other  standards  are  engaged  that  represent  the  state  of  technolo¬ 
gy  such  as  DIN  V  VDE  0801  [3]  and  MU  8004  [5]. 

4.2  Certification  Process 

The  certification  process  conducted  by  Uic  certifier  Institute  for  Software, 
Electronics  and  Railroad  Technology  (ISEB)  within  TUV  Rheinland  Sicherheit  und 
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Umweltschutz  GmbH,  consists  of  five  major  steps  which  concur  with  the 
manufacturers  development  {mxx^ss: 

•  Set-up  of  a  Certification  Plan 

•  Review  of  manufacturer's  Quality  Assurance  System 

•  Adaptation  of  the  certifier's  configuration  and  documentation  management 
system  to  the  manufacturer's  documentation  structure 

•  Verification  and  Validation 

•  Safety  Trials 

•  Compilation  of  a  Final  Repmt  on  the  system's  safety 

In  close  cooperation  with  the  manufacturer  the  Certification  Plan  defines  kind  and 
scq)e  of  certification  steps  to  be  conducted  by  the  certifier  such  as  reviews,  inspec¬ 
tions,  audits,  analyses,  and  tests.  The  depth  and  thoroughness  of  the  certifier's  activi¬ 
ties  depend  immediately  on  those  activities  which  are  planned  and  performed  by  the 
manufacturer  and  on  the  insight  and  understanding  which  the  certifier  gains  from 
these  activities. 

Firstly,  the  basic  steps  of  the  certification  process  are  defined  in  principle. 
Concurrent  to  the  certification  fnocess  these  basic  steps  are  refined  as  the  develop¬ 
ment  phases  emerge.  This  imcess  leads  to  a  conunon  understanding  of  the  set  of 
relevant  items  which  are  subject  to  certification  such  as  documents,  processes,  soft¬ 
ware,  and  hardware  components.  For  each  relevant  item,  it  is  determined  which  ac¬ 
tivities  need  to  be  performed  by  the  certifier.  The  certifier  then  adjusts  his  configura¬ 
tion  and  documentation  management  system  to  the  set  of  relevant  documents. 
Furthermore,  a  schedule  is  agreed  upon  which  defines  the  dates  of  the  submittal  of 
certification  relevant  items  as  well  as  the  duration  of  each  certification  activity. 

The  more  rigorous  and  comprehensive  the  QA  measures  of  the  manufacturer  are 
and  the  better  these  measures  are  documented,  the  more  the  certifier  may  limit  his 
efibrts  to  minimal  measures  to  gain  confidence  in  the  manufacturer's  QA  measures. 
For  this  reason  the  manufacturer's  intended  QA  system  is  reviewed  in  the  early 
phases  of  the  certification  in  close  cooperation  between  the  manufacturer  and  the 
certifier.  This  is  done  based  on  apin'opriate  documentation,  e.g.  Software  Quality 
Assurance  Plans  and  Software  Verification  and  Validation  Plans  taken  from  former 
projects.  Guidelines  for  documentation,  design  methods,  coding  standards,  and 
project  management  as  well  as  the  structure  of  test  plans  are  established  in  these 
early  phases.  The  actual  certification  activities  then  consist  of  the  following  main 
steps  which  are  often  similar  even  in  difierent  projects: 

•  Review  of  Concept  Descriptions 

•  Validation  of  System  Requirements 

•  Validation,  Verification  and  Certification  of  Hardware  and  Software 

•  Integration  Tests 

•  Safety  Trials 

In  each  phase  of  the  certification  process  a  close  contact  '^tween  certifier  and 
developers  makes  it  possible  to  communicate  emerging  problem^  as  soon  as  possible 
and  hence  fix  them  as  early  as  possible. 

The  following  sections  describe  the  methods  engaged  during  the  development  of 
the  Automatic  Train  Protection  system.  This  only  provides  an  impression  of  the  cer- 
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tification  process.  A  detailed  description  would  have  exceeded  the  boundaries  of  this 
paper. 

43  Hardware  Certification 

The  hardware  certification  consists  of  the  review  of  documentation  pertinent  to  all 
ATP  hardware  components.  Among  others.  Lower  Level  Fault  Tree  Analyses,  Envi¬ 
ronmental  Analyses,  Descriptions  and  Specifications,  Detailed  Hazards  Analyses, 
FMEA,  Wmst  Case  Analyses,  Test  Plans  and  Test  Results  for  each  component  are 
inflected  and,  when  necessary,  supplemented.  Boards  and  other  components  are 
inspected  thoroughly.  For  instance,  the  creepage  distances  are  examined  carefully. 
Some  components  such  as  relays  are  tested  thoroughly  by  the  certifier.  The 
execution  of  tests  of  other  components  by  the  manufacturer  are  witnessed. 

4.4  Software  Validation  and  Verification 

4.4.1  Software  Life  Cycle  and  QA  System 

The  software  is  developed  by  the  manufacturer  according  to  a  software  life  cycle 
model  consisting  of  a  Concept  Phase,  Requirements  Phase,  Design  Phase, 
Implementation  Phase,  Test  Phase  (Unit  Tests  through  System  Integration  Tests), 
Installation  and  Checkout  Phase,  Operations  and  Maintenance  Phase.  For  each 
phase,  the  Software  Quality  Assurance  Plan  and  Software  Verification  and 
Validation  Plan  exactly  define  the  specific  tasks  of  design,  verification  and 
validation.  The  entity  (development  or  V&V)  responsible  for  performing  a 
particular  task,  as  well  as  the  means  for  documenting  the  task  results,  are  defined.  A 
group  of  engineers  is  assigned  by  the  manufacturer  to  be  responsible  for  all  V&V 
effort  .  These  engineers  are  independent  from  the  development  engineers.  This  leads 
to  a  high  quality  of  items  submitted  to  the  certifier  and  consequently  less 
certiHcation  effort. 

Besides  the  definition  of  V&V  tasks,  the  manufacturer's  Quality  Assurance 
System  takes  into  account  further  regulations  such  as  Coding  Standards, 
Documentation  Guidelines,  or  Software  Requirements  Specification  Procedures. 
Each  component  of  the  Quality  Assurance  System  is  assessed  by  the  certifier. 

4.4.2  Validation  and  Verification  Methods 

All  documents  have  been  validated  and  verified  thoroughly  by  the  certifier  with 
methods  like  walk  through,  reviews,  inspections,  and  static  analysis.  The  manufactu¬ 
rer  conducts  comprehensive  tests  on  units,  during  software  integration,  and  during 
system  integration.  These  tests  are  witnessed  by  the  certifier.  Prior  to  the  actual  test 
execution,  the  certifier  reviews  the  test  plans  generated  by  the  manufacturer's  V&V 
Group  in  order  to  ensure  that  all  relevant  system  states  are  tested.  When  necessary 
the  certifier  supplements  these  test  cases.  In  order  to  achieve  a  comprehensive  set  of 
test  cases  the  Cause-Effect  Graph  Method  is  used  by  the  manufacturer's  V&V  group 
to  define  the  requirements-based  Black  Box  test  cases.  Black  Box  tests  are  executed 
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first  and  branch  coverage  achieved  by  these  cause-effect  graph  based  test  cases  is 
measured.  If  this  coverage  does  not  reach  100%  branch  coverage,  additional  White 
Box  tests  are  conducted.  This  systematic  test  approach  leads  to  comprehensive  tests 
of  the  ATP  software. 

4.4.3  Type  Certification 

As  soon  as  certain  components  (sub-components,  boards,  computers,  software  com¬ 
ponents  like  the  Rules-Based  Interlocking  Engine)  are  type  certified  they  may  be 
regarded  as  building  blocks.  The  certiOcation  of  a  speciBc  application  may  then 
focus  on  the  correct  application  of  these  building  blocks  and  the  validation  of 
application-specific  conflguration  data. 


5  Conclusions 

The  chosen  approach  leads  to  improvements  in  costs,  scheduling  and  safety.  These 
improvements  are  made  possible  mainly  by  the  flexibility  and  reusability  of  the 
Rules-Based  Interlocking  Engine,  the  flexible  and  effective  concurrent  certification 
process,  and  the  manufacturer's  development-independent  Quality  Assurance 
activities.  In  turn,  as  shown  above,  the  rigorous  verification  and  validation  measures 
ensure  the  safety  of  the  Automatic  Train  Control  system  with  the  above-mentioned 
significant  cost  improvements. 
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1  Introduction 

Since  the  installation  of  the  first  mechanical  interlocking  in  1856,  railway  signal 
engineers  have  developed  a  set  of  rules  which  define  the  essential  requirements 
for  safe  train  movement.  In  the  majority  of  cases  this  set  of  rules  can  be 
expressed  as  a  closed  set  of  boolean  equations  which,  when  implemented  as 
written,  yield  a  safe  operating  system.  The  boolean  equation  set  will  vary 
depending  upon  the  pculiicular  requirements  of  each  application.  The  set  of 
general  rules  are  imposed  on  the  specific  requirements  of  each  application  to 
yield  a  closed  set  of  boolean  equations  which  completely  describe  the  safety 
and  operational  requirements  of  that  application. 

By  implementing  the  set  of  equations  with  hardware  elements  which  have 
known  failure  modes  it  is  possible  to  not  only  create  a  safe  operating  system 
but  also  a  failsafe  operating  system.  For  the  past  50  years,  this  has  been 
accomplished  using  the  ‘safety  relay.’  The  ‘safety  relay’  has  a  known  set  of 
failure  modes.  More  importantly,  it  has  by  design  eliminated  some  failure 
modes  which  are  common  in  general  purpose  rela}rs.  This  relay  is  designed, 
such  that  its  front  contacts  will  not  be  closed  unless  the  relay’s  coil  is  properly 
energized.  While  this  allows  the  signal  engineer  to  implement  a  safe  system. 
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designing  a  relay  to  have  a  particular  set  of  failure  characteristics  results  in 
relays  which  are  costly  and  physically  large  (GRS  B1  relay  is  approximately 
16  cm  X  2.5  cm  x  8.6  cm.  and  weights  about  4  kg).  Even  a  simple  application 
may  use  in  excess  of  100  relays  and  have  more  than  1000  wire  connections. 

While  relay  based  systems  are  very  robust,  they  are  also  very  inflexible. 
Changes  to  the  control  system  requires  the  addition,  deletion  or  modification 
of  relays  and  interconnecting  wires.  This  is  labor  intensive  and  in  many  cases 
it  requires  physical  space  that  is  not  available.  The  microprocessor  on  the 
other  hand  oflers  extreme  flexibility,  small  size,  and  low  cost.  However,  it  also 
offers  hmdware  with  a  set  of  failure  modes  which  cannot  be  totally  defined, 
and,  therefore,  cannot  be  used  to  achieve  a  failsafe  system  based  on  failure 
characteristics.  Thus,  the  challenge  is  to  develop  an  algorithm  that  will  allow 
the  microprocessor  to  be  used  to  solve  boolean  expressions  in  a  way  that  insures 
that  any  and  all  fieulures  which  might  affect  the  attainment  of  correct  results 
are  revealed  and  used  to  force  the  system  to  a  known  safe  state. 

One  approach  that  has  been  implemented  in  safety  critical  applications 
is  the  use  of  multiple  processors  checking  each  other.  This  approach  is  im¬ 
plemented  in  various  ways  including:  two  or  more  identical  processors  with 
identical  software,  two  or  more  identical  processors  with  diverse  software  and 
two  or  more  different  processors  with  diverse  software.  In  each  case  process  re¬ 
sults  are  checked  for  agreement,  and  the  lack  of  agreement  is  used  to  force  the 
system  to  a  safe  state.  Each  of  these  implementation  methodologies  has  areas 
which  must  be  thoroughly  analyzed  before  they  can  be  accepted  as  producing 
a  fsdlsafe  system. 

The  identical  processors  using  identical  software  approach  requires  that  the 
initial  software  be  proven  to  be  completely  error  free.  This  technique  is  only 
effective  in  revealing  independent  hardware  failures  which  cause  the  checked 
results  to  differ.  It  does  not  protect  agednst  software  design  errors.  If  both 
systems  have  the  same  embedded  flaw  then  they  will  both  act  on  it  the  same  and 
the  flaw  will  not  be  revealed.  Proving  that  the  software  contains  no  embedded 
flaws  or  is  error  free  is  difficult  if  not  impossible  [1]. 

The  use  of  identical  processors  with  diverse  software  requires  proof  that  the 
software  is  actually  diverse.  If  the  software  is  written  to  the  same  specification 
for  the  same  processor,  using  the  same  command  set  the  level  of  diversity  of 
the  final  set  of  software  appears  questionable.  Also,  it  appears  to  be  difficult 
to  prove  that  the  software  is  actually  diverse  enough  to  reveal  all  embedded 
errors  and/or  hardware  failures. 

The  approach  of  using  diverse  hardware  and  diverse  software  would  appear 
to  solve  the  problems  of  these  other  appro8u:hes,  but  it  actually  requires  two 
complete  developments,  two  complete  sets  of  hardware  and  two  complete  sets 
of  installation  tests.  This  is  costly,  and  hardware  dependent.  Also,  in  all  of 
these  cases  the  voting  or  checking  algorithm  must  be  developed  and  analyzed 
to  prove  that  it  provides  the  safe  operation  required. 

These  approaches,  while  not  perfect,  do  offer  the  possibility  that  the  micro¬ 
processor  can  be  applied  to  safety  critical  s3rstems.  However,  they  suffer  from 
serious  practical  problems  related  to  the  cumbersome  analysis  that  must  be 
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doiK  to  validate  the  level  of  safety  they  provide  in  each  installation.  Much  of 
the  analysis  must  be  repeated,  in  detail,  for  each  system  that  is  installed. 

In  OTder  to  use  microprocessors  in  the  rail  industry  we  determined  [2]  that 
the  solution  must  not  only  resolve  the  safety  problem  but  it  must  also  produce  a 
system  that  is;  1)  analyzable  and  verifiable  to  some  minimum  level  of  safety,  2) 
hardware  independent,  3)  application  independent  (such  that  it  does  not  have 
to  be  safety  verified  for  each  new  application),  4)  easy  to  apply  to  different 
sets  of  boolean  equations,  5)  not  based  on  proving  the  software  to  be  error 
free  and  6)  cost  effective.  The  developed  solution  has  been  named  Numerically 
Integrated  Safety  Assurance  Logic  (NISAL)  [3].  This  solution  is  effective  in 
meeting  the  stated  goals  and,  because  it  is  numerically  based,  it  allows  the 
upper  bound  of  the  probability  of  an  unsafe  event  to  be  calculated  [4].  This 
then  allows  the  lower  bound  of  the  mean  time  between  unsafe  events  (MTBUE) 
to  be  calculated.  These  calculations  can  be  done  once  because  they  are  an 
intrinsic  part  of  the  system  design.  They  do  not  need  to  be  redone  for  each 
application. 


2  The  Algorithm 

The  basic  structure  of  a  system  that  uses  numerically  integrated  logic  is  shown 
in  Figure  1.  A  set  of  sensors  which  have  relay  contacts  that  are  either  open  or 
closed  provides  information  on  the  state  of  the  railroad  system.  These  sensors 
are  probed  by  the  system  to  determine  their  values,  and  the  TRUE  or  FALSE 
state  is  used  in  primordially  safe  boolean  expressions  to  determine  the  proper 
settings  for  the  output  devices. 


Sensors 

Evaluation 

Output 

^ . 

Devices 

^  check 

^ check 

Figure  1:  Simplified  structure  of  the  system. 

The  safety  of  the  process  is  assured  by  causing  it  to  generate  checkwords 
that  are  used  in  a  validation  step  by  an  independent  agent,  described  below. 
Each  process  step  -  evaluating  a  boolean  expresion,  verifying  the  state  of  an 
input  port,  verifying  the  state  of  an  output  port  setting,  verifying  that  a  sec¬ 
tion  of  computer  memory  was  cleared,  etc.  -  generates  a  checkword,  and  each 
checkword  must  satisfy  the  independent  agent.  In  current  implementations  the 
checkwords  are  thirty-two  bit  binary  numbers.  The  operations  are  so  structured 
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that  the  system  generates  each  checkword  by  correctly  doing  the  associated  op¬ 
eration.  If  the  operation  is  not  done  correctly  then  the  checkword  is  effectively 
a  random  pattern  with  only  a  small  chance  of  being  correct.  Thus,  incorrect 
checkwords  reveal  inproperly  done  operations. 

The  independent  agent  consumes  checkwords  that  are  produced  by  the  as¬ 
tern  and  provides  the  energy  to  operate  the  output  devices.  If  the  agent  is  not 
satisfied  with  the  sequence  of  checkwords  it  is  receiving,  it  removes  energy  firom 
the  intern  and  causes  it  shut  down  in  a  known  safe  state. 

Neither  the  system  nor  the  agent  “know”  the  correct  checkwords  in  the 
sense  that  they  are  stored  in  memory  to  be  looked  up.  This  would  lead  to  the 
possibility  of  improper  operation  being  unrevealed  due  to  the  correct  check- 
word  sequence  being  obtained  from  memory.  As  we  have  seen,  the  system 
must  g^erate  them  by  proper  operations.  Similarly,  the  agent  responds  very 
selectively  to  its  input  binary  pattern.  If  the  pattern  is  not  correct  then  the 
response  will  not  support  the  continued  flow  of  system  energy.  The  response 
characteristics  are  determined  by  the  structure  of  the  agent,  which  is  matched 
to  the  proper  checkword  pattern  in  a  manner  similar  to  the  correspondence  of 
a  matched  filter  to  a  particular  signal. 

The  independent  agent  has  two  possible  responses:  (1)  provide  vital  power, 
and  (2)  remove  vital  power.  Providing  vital  power  is  a  positive  action  that 
depends  upon  the  correct  input  sequence.  If  the  sequence  is  not  correct,  energy 
is  removed  from  the  system  and  it  shuts  down  safely.  The  operation  of  providing 
or  removing  vital  power  can  be  accomplished  in  a  number  of  ways,  including 
the  use  of  a  vital  power  supply  or  a  vital  relay. 

To  illustrate  the  nature  of  the  checking  process,  consider  the  simplest  pos¬ 
sible  situation  -  one  in  which  there  is  a  single  expression  to  be  evaluated. 
Suppose  that  the  states  of  input  devices  are  represented  by  A,  B,  etc.  eind  that 
the  logical  value  of  the  expression  is  represented  by  x  as  in  the  equation 

x  =  ABC  +  DE  +  FGHI  +  --  (1) 

X  will  have  the  value  TRUE  if  and  only  if  all  of  the  variables  in  at  least  one  of 
the  product  terms  are  TRUE.  (Values  which  should  be  false  for  the  equation  to 
evaluate  to  TRUE  would  be  complemented  in  the  equation.) 

Suppose  that  the  TRUE  and  FALSE  values  of  x  are  represented  by  N-bit 
binary  patterns,  and  that  the  TRUE  pattern  is  unknown  by  the  evaluator.  The 
proper  pattern  is  produced  by  scanning  the  expression  until  the  first  product 
term  that  has  proper  values  for  a  TRUE  result  is  found.  The  parameters  in  that 
term  are  processed  by  the  evaluator,  in  a  manner  described  below,  to  produce 
an  N-bit  pattern  for  x.  If  no  product  term  is  found  that  has  proper  values  for  a 
TRUE  result,  then  the  N-bit  FALSE  pattern  is  inserted  for  x.  The  system  logic 
is  designed  so  that  no  FALSE  result  can  cause  an  unsafe  condition. 

A  large  system  will  have  many  boolean  expressions,  but  the  principle  is  the 
same.  Each  expression  is  processed  to  produce  a  binary  pattern,  which  is  used 
as  a  checkword.  By  making  the  binary  patterns  sufficiently  long,  it  is  highly 
imi»obable  that  they  can  be  produced  by  chance.  It  is  even  more  unlikely  that 
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a  system  with  a  hidden  failure  will  produce  a  correct  sequence  of  checkwords 
by  random  selection. 

The  input  sensors  are  read  by  transmitting  particular  digital  patterns  to  the 
input  test  circuits  and  observing  the  returned  patterns.  A  particular  sensor  is 
taken  to  have  a  TRUE  reading  only  if  the  returned  pattern  is  a  codeword  in 
a  suitable  error  detecting  code.  This,  however,  does  not  produce  the  vital 
input  sensor  value.  It  only  establishes  a  hypothesis  about  the  setting  of  the 
input  port.  Any  TRUE  setting  is  then  verified  by  transmitting  smother  pattern 
through  the  port  and  using  that  pattern  in  the  computations. 

The  settings  of  output  devices  are  verified  in  a  similar  fashion.  The  state  of 
the  devices  are  set  by  sending  commands  based  on  the  results  of  the  boolean 
evaluations.  The  output  states  are  then  verified  by  transmitting  state-dependent 
patterns  through  the  devices  and  using  those  results  together  with  the  intended 
settings  to  produce  the  signal  for  the  independent  agent. 

To  guard  against  the  possibility  that  some  internal  system  failure  could  be 
unrevealed  by  a  particular  checking  algorithm,  the  system  may  be  designed  to 
do  a  second  set  of  checking  operations  that  are  informationally  redundant  but 
algorithmically  diverse  from  the  first  set.  The  current  implementation  produces 
a  pair  of  checkword  results  for  each  operation,  say  x  and  x',  each  created  by 
a  separate  checkword  algorithm  implemented  as  a  finite-state  machine  with  a 
unique  system  matrix. 

2.1  Structure  of  the  Evaluation  System 

There  are  many  wajrs  to  construct  an  evaluator  which  will  produce  a  given 
output  pattern  from  a  sequence  of  input  patterns.  A  basic  form  is  a  finite- 
state  machine,  which  is  started  in  an  initial  state  and  then  stepped  through 
a  state  sequence.  Each  parameter  in  a  boolean  product  term  is  represented 
by  a  binary  pattern,  and  those  patterns  serve  as  the  machine  inputs.  Each 
pattern  in  the  input  sequence  influences  the  next  evaluator  state.  The  final 
state  is  completely  predictable  from  the  initial  state  and  the  input  sequence. 
The  evaluator  output,  which  is  a  function  of  the  fined  system  state,  will  be  a 
proper  binary  pattern  if  it  begins  in  the  correct  initial  state  and  edl  of  the  input 
peureuneter  values  have  their  expected  patterns. 

A  table  of  initial  conditions,  one  for  each  product  term  of  each  boolean 
expression,  is  maintained.  When  a  particular  product  term  is  to  be  evaluated, 
the  initial  condition  for  that  product  term  is  taken  from  memory  and  used  as 
the  initial  state  vector  of  the  finite-state  machine.  The  machine  is  then  cycled 
with  the  parameter  value  patterns  as  inputs.  When  all  of  the  parameters  have 
been  consumed  the  output  is  calculated  as  a  function  of  the  final  state  of  the 
machine.  An  independent  evaluation  of  each  boolean  expression  is  done  with 
a  second  evaluator  using  parameters  from  the  second  input  channel.  The  two 
results,  X  and  form  the  diverse  pair  of  binary  patterns. 

The  state  of  a  finite-state  machine  at  a  particular  time  t  +  1  is  determined 
by  its  state  at  time  t  and  the  inpu*  at  time  t.  If  the  inputs  are  represented 

the  sequence  U  =  {ui,  U2, . . . ,  Un}  and  if  the  states  are  represented  by  the 
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sequence  V  =  (vi,  vj, . . . ,  v^},  th^ n 

Vj+i  =  v<T  +  UjU  (2) 

where  T  is  the  system  matrix  and  U  is  the  input  matrix.  If  the  system  has 
r  internal  states  and  s  inputs,  then  T  is  r  x  r  and  U  is  s  x  r.  The  system 
behavior  is  determined  almost  completely  by  the  properties  of  T. 

If  the  system  is  initialized  to  be  in  state  vi,  then  the  state  can  be 
shown  to  be 

t 

v,+,=v,T‘+5;u,UT‘--»  (3) 

The  state  Vi+i  is  a  function  only  of  the  initial  state  vi  and  the  sequence  of 
inputs  ui,...,Uj. 

The  checkword  for  the  evaluation  of  a  boolean  expression  is  computed 
(1)  finding  the  first  product  term  that  has  proper  parameter  values  to  produce 
a  TRUE  result  (or  directly  produce  the  FALSE  checkword  as  the  value  for  x 
if  there  is  no  such  term);  (2)  selecting  an  initialization  value  for  that  product 
term  (the  value  of  vi);  (3)  setting  the  initial  value  of  the  machine;  (4)  operating 
the  machine  with  the  product  term  t  as  input  u^;  (5)  after  the  n  parameters 
of  that  product  term  have  all  been  entered,  using  Vn4.i  as  the  system  output. 
Thus,  the  output  is  determined  by  the  initial  state  and  the  input  sequence. 
For  the  algorithm  to  be  useful  for  our  purposes  it  must  be  possible  to  compute 
the  initial  condition  to  produce  the  target  output  value  for  eadi  parameter  set 
and  it  must  be  the  case  that  an  error  in  any  bit  of  the  initial  condition  or  any 
parameter  will  cause  the  system  to  produce  a  different  output  pattern. 

It  is  a  relatively  simple  matter  to  compute  the  initial  condition  required 
to  produce  a  desired  target  output  t  for  a  given  input  sequence  14.  In  (3)  let 
t  =  n  and  set  Vn+i  =  t.  Then  sequentially  step  backwards  through  the  states 
by  doing  the  calculation 

v<  =  (v<+x+ikU)T-»  (4) 

Multiplying  fay  is  equivalent  to  stepping  the  machine  backward.  After  n 
steps  we  reach  the  desired  value  for  the  initial  condition,  Vi. 

The  effect  of  an  error  in  the  initial  condition  or  any  of  the  inputs  is  related 
to  the  dynamics  of  the  system  when  it  is  allowed  to  operate  with  zero  input. 
Because  the  system  described  by  (2)  is  linear,  the  effects  of  individual  errors 
can  be  found  by  superposition.  The  effect  of  an  error  in  the  initial  condition 
vi  can  be  seen  by  allowing  the  initial  condition  to  be 

vi=vx+v5  (5) 

where  vi  is  the  intended  value,  vf  is  the  error,  and  vj  is  the  initial  condition 
actually  used.  Tbe  final  state  fix)m  this  state  is 

♦.+i=VtT“  +  f^u,OT"-< 
i-1 


(6) 
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The  difieraioe  between  the  actual  final  state  and  the  intended  final  state  can 
be  found  subtracting  (modulo  2  addition)  (3)  with  i  =  n  fix>m  (6). 

e<c  =  v„+i  +  v„  .1  =  vf  T”  (7) 

The  above  result  shows  that  an  error  in  the  initial  condition  depends  only  on 
the  error  value  vf  and  the  system  matrix  T.  It  does  not  depend  upon  the 
intended  initial  condition  or  any  input  value. 

Similarly,  suppose  that  there  is  an  error  in  a  particular  input  value,  say  Um, 
1  <  m  <  n.  Then  ^  replaces  Um  in  the  computation  of  the  final 

state.  The  result  is 

n 

v„  =  viT"  +  UjUT”--'  +  (8) 

i=i 

The  effect  of  the  input  error  is  given  by  the  last  term  in  the  expression. 

eS  =  ^UT-"  (9) 

Once  again,  the  effect  of  the  error  does  not  depend  upon  the  desired  initial 
condition  or  the  desired  input  values.  It  is  simply  a  product  of  the  input  error, 
the  input  matrix,  and  the  system  matrix  raised  to  the  power  n  —  m.  The  same 
resvdt  would  be  achieved  by  starting  the  system  in  state  u^U  and  stepping 
it  n  —  m  times  with  no  other  input.  The  mechanism  for  an  individual  error 
affecting  the  output  is  the  system  transient  response,  as  determined  by  powers 
ofT, 

Errors  in  other  input  parameters  can  be  treated  similarly.  The  total  error 
in  the  final  state  is  given  by 


et  =  vJT"  +  (10) 

m=l 

Each  error  causes  a  system  transient  response  to  be  reproduced.  The  total 
error  is  the  sum  of  the  error  transients  set  up  by  the  individual  errors.  This 
observation  allows  us  to  investigate  the  effects  of  errors  on  the  system  output 
by  examining  the  transient  response,  or  the  autonomous  system  response. 

2.2  Autonomous  System  Response 

An  autonomous  system  has  all  inputs  set  to  zero.  If  it  is  started  in  a  state  s  then 
the  sequence  of  internal  states  is  {s,  sT,  sT^, . . .}.  The  powers  of  the  matrix  T 
must  eventually  begin  to  repeat,  causing  this  to  be  a  repeating  state  sequence. 
The  fact  that  powers  of  T  must  repeat  is  a  consequence  of  the  Cayl^-Hamilton 
theorem,  which  says  that  a  matrix  must  satisfy  its  own  characteristic  equation. 


♦(X)  =  |T  +  XI| 


(11) 


The  charactoistic  equation  is  a  polynomial  in  X.  If  T  is  r  x  r  then  the  highest 
power  of  X  in  #(X)  is  X^. 

=  do  "i"  dl  +  • . .  +  dr— 1-^**  ^  "i"  dr®*^  (12) 

where  the  coefficients  are  from  the  binary  number  field  {0, 1}  and,  in  particular, 
dr  =  1-  Since  T  must  satisfy  this  equation, 

T"  =  dol  +  djT  + . . .  +  dr— iT*^  ^  (13) 

Every  power  of  T  can  be  represented  as  a  polynomial  in  T  with  maximum 
degree  r  —  1. 

T*  =  6oI  +  6iT  +  ...  +  6r-iT’-i  (14) 

where  the  coefficients  are  binary  numbers.  There  at  most  2*’  —  1  distinct  powers 
of  T  (excluding  T  =  0),  so  that  the  state  sequence  has  a  maximum  length  of 
2^—1  before  repeating.  This  maximum  length  will  be  achieved  for  linear  fininte- 
state  machines  in  which  the  characteristic  equation  is  a  primitive  polynomial 
of  degree  r. 

A  maximum-length  LFS  machine  of  order  r  can  be  constructed  by  building 
a  linear  feedback  shift  roister  with  feedback  connections  determined  by  the 
coefficients  of  a  primitive  polynomial  $(A')  of  degree  r.  There  are  at  least 
two  primitive  polynomials  of  every  degree,  so  this  is  always  possible.  A  good 
discussion  of  the  shift-register  implementation  of  finite-state  machines  and  their 
properties  is  contained  in  Peterson  and  Weldon  [5]. 

2.3  Error  Effects 

The  independent  agent  “expects”  a  cl^ckword  sequence  {zi}.  Its  design  is 
such  that  it  will  fail  to  provide  system  energy  if  this  sequence  is  incorrect.  In 
the  current  implementation  each  is  a  thirly-two  bit  binary  number  that  is 
generated  by  the  operation  of  the  LFS  madiines  as  described  above.  To  find 
the  probability  that  the  independent  agent  will  fail  to  shut  the  system  down 
in  the  event  of  a  computational  error,  we  need  to  compute  the  probability  of 
getting  the  proper  value  of  Zj  with  an  erroneous  evaluation. 

The  effect  of  an  individual  error  depends  upon  both  its  value  and  the  time  of 
occurrence.  The  effects  of  multiple  errora  combine  by  superposition,  as  shown 
(10).  A  combination  of  errors  will  be  invisible  only  if  the  effects  sum  to  0. 
This  can  occur  only  when  there  are  two  or  more  errors  because  no  single  error 
can  produce  the  0  state. 

As  an  example,  let  us  look  at  the  requirements  of  a  pair  of  errors,  say  at 
times  t  =  j  and  t  =  k  with  k  >  j,  such  that  their  combined  effect  is  0  at  some 
later  time,  say  t  =  n.  The  requirement  is 


„«UTn-i  ^  ueUT*-*  =  0 

(15) 

uju  =  u^UT^-^ 

(16) 

The  right-hand  side  can  be  any  nonzero  pattern.  In  the  current  system  in 
which  r  =  32,  there  are  2’'  -  1  =  2^2  -  1  =  4, 294, 967, 295  «  4.3  •  10»  distinct 
patterns.  The  left-hand  side  must  match  the  pattern  that  is  chosen  by  the 
right-hand  side.  Starting  at  a  random  initial  state,  produced  for  instance  by 
a  random  error  pattern  combining  with  the  shift-register  contents  at  time  j, 
the  system  is  equally  likely  to  be  in  any  of  the  nonzero  states  k  —  j  steps  later. 
The  probability  of  the  error  at  time  k  just  canceling  out  the  random  pattern 
caused  by  the  error  at  time  j  is  one  in  2**  —  1,  which  can  be  made  as  small  as 
desired  by  the  choice  of  r.  In  the  current  implementation  this  probability  is 
about  2.33  •  10" 

An  equivalent  analysis  applies  to  the  case  of  more  than  two  errors.  The 
chance  that  several  errors  will  combine  to  produce  a  0  result  is  approximately 
2-^. 

Recall  that  the  system  uses  two  channels  for  diversity  in  error  checking. 
The  probability  that  both  channels  produce  a  0  response  to  two  or  more  errors 
is  the  product  of  the  individual  probabilities,  or  about  5.43  •  10~^.  In  a  system 
in  which  there  is  an  independent  error  opportunity  about  once  every  100  mil¬ 
liseconds,  the  expected  time  between  such  occurrences  (MBTUE)  is  5.8  •  10^® 
years. 

3  Applications 

This  algorithm  has  been  appUed  in  the  Vital  Processor  Interlocking  (VPI), 
the  Microcabmatic^  and  the  Apparato  Statico  con  Calcolatore  Vitale  (ASCV*) 
products.  The  VPI  and  ASCV  products  are  used  to  control  signeds  and  switches 
at  interlockings  (locations  were  multiple  tracks  are  connected  together  to  allow 
various  train  movements)  and  the  Microcabmatic  product  is  used  on  board 
vehicles  (locomotives  and  transit  vehicles)  to  insure  safe  speed  enforcement. 
These  products  have  been  applied  in  more  than  350  installations  and  to  date 
have  achieved  over  3,000,000  hours  of  safe  operation.  There  has  not  been  a 
failure  of  the  algorithm  to  insure  safe  operation,  furthermore,  the  systems 
have  proven  to  be  reliable  and  have  not  suffered  from  unnecessary  shut  downs. 
For  example,  the  average  VPI  system  has  demonstrated  a  mean  time  between 
system  shutdowns  (assuming  no  systems  utilize  backup  or  standby  redundancy) 
of  over  50,000  hours  due  to  hardware  failmes  including  those  in  the  input  and 
output  interface  circuitry. 

The  algorithm  has  been  reviewed  and  anedyzed  by  rail  and/or  transit  au¬ 
thorities  in  at  least  6  different  countries,  and  all  have  found  it  to  be  acceptable 
for  their  use.  Since  the  algorithm  is  not  application  dependent,  this  equipment 
has  found  use  in  other  rail  applications  were  boolean  equations  are  used,  in¬ 
cluding  speed  limit  selection  control  and  highway  crossing  gate  control.  Also, 
since  the  algorithm  is  application  independent  it  is  easUy  used  by  countries 
whose  signalling  philosophies  differ  from  that  used  in  the  USA. 

^VPI  and  Microcabmatic  are  registered  trademarks  of  GRS  Corporation. 

^ASCV  is  a  registered  trademark  of  SASIB  Signalamento  Ferroviorio. 
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The  powOT  of  the  microprocessor  has  sdlowed  these  products  to  include 
embedded  diagnostics.  These  diagnostics  have  been  instrumental  in  system 
maintenance  and  in  reducing  the  time  needed  to  restore  a  system  after  a  fail¬ 
ure.  A  annputer  aided  applications  package  has  also  been  developed  for  these 
products  which  allows  the  application  process  to  be  partially  automated  and 
allows  the  systems  to  be  simulated  before  they  are  installed.  The  computer 
applications  package  and  simulator  in  conjunction  with  the  flexibility  of  this 
scheme  has  been  instrumental  in  applications  which  progress  through  multiple 
stages  as  a  result  of  changing  track  work. 

4  Conclusion 

The  NISAL  algorithm  has  proven  to  be  both  robust  and  flexible.  It  has  provided 
a  system  which  is  analyzable  and  allows  the  maximum  probability  of  an  unsafe 
event  to  be  calculated  independent  of  the  hardware  and  the  application.  It 
is  easily  applied  and  adapted  to  the  different  signalling  philosophies  found  in 
different  countries.  It  does  not  require  a  proof  that  the  software  is  error  free.  Its 
acceptance  worldwide  has  allowed  the  power  and  advantages  of  microprocessors 
to  be  applied  to  the  rail  industry  while  maintaining  high  safety  standards. 
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Abstract 

The  PESANTE^  project  intends  to  arrive  at  an  integral  approach 
towards  PES^  assessment.  Elidtadon  of  knowledge  on  relations 
between  PES  characteristic  and  fundioning  is  the  first  step.  Here 
categorical  analysis  plays  a  major  role.  The  results  of  this  phase 
will  be  used  to  tune  a  Bayesian  inference  network.  This  network 
is  able  to  assess  PESs  given  an  amount  of  information  on  the 
PES  characteristics.  The  techniques  chosen  are  able  to  cope  with 
heterogeneous  and  missing  data.  PESANTE  wiU  cover  software 
and  hardware  aspects,  as  well  as  the  human  factor.  Also,  it  can 
indicate  the  value  of  information  to  be  procured  next;  this  makes 
sure  a  balanced  assessment  is  being  made. 


PES  >  Programmable  Electronic  System 

PESANTE  s  Programmable  Electronic  System  ANalysis  TEchnique 
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1  Introduction 

Before  iiwtalling  safety  critical  programmable  electronic  systems  in  industry,  one 
wants  to  assess  the  dependability  aspects  of  the  system.  This  is  done  with  the 
purpose  o£ 

•  Choosing  the  right  PES  in  the  right  application;  balance  of  costs  and 
performance. 

.  Balancing  the  overall  dependability  of  a  total  system;  the  dependability 
requirements  for  the  PES  are  related  to  those  of  the  total  system. 

-  Numerical  assessment  of  plant  dependability  performance. 

The  problem  with  the  available  techniques  is  that  they  tend  to  be  unbalanced,  not 
using  as  much  data  as  would  be  possible,  and  arriving  at  unusable  measures. 

-  They  are  unbalanced.  Most  techniques  only  address  hardware  performance. 
Some  address  software  performance  and  only  few  address  the  human  factor. 
E3q)erience  has  shown  that  all  three  of  these  are  important  to  arrive  at  dependable 
systems.  Almost  no  technique  addresses  the  combination  of  the  three  of  them. 

-  They  ignore  information.  Each  technique  only  includes  data  that  fits  in  its 
mathematical  framework.  From  these,  most  techniques  available  at  the  moment 
disregard  much  information.  More  specific  they  only  include  numerical  data  on 
only  a  few  aspects  of  the  system’s  characteristics. 

•  They  result  in  unusable  measures.  Unusability  of  measures  happens  in  two  ways: 
too  much  information  on  a  small  part  of  a  system  (e.g.  MIL-HDBK-217 
calculation  of  hardware  failure  rate  [1]),  or  information  that  is  hard  to  handle  (e.g. 
the  number  of  remaining  bugs  in  a  program). 

TNO  has  defined  the  PESANTE  [2]  project,  in  cooperation  with  SINTEF  and 
Norsk  Hydro,  Dow  Europe,  and  Glasgow  Caledonian  University’s  Software 
Metrics  Laboratory.  The  aim  of  the  PESANTE  is  to  measure  dependability 
characteristics  of  programmable  electronic  systems  in  process  industry.  The 
method  is  to  be  used  for  highly  reliable  PESs  in  safety  applications  resulting  in  a 
balanced,  complete  and  usable  assessment. 


2  Baseline  of  the  Project 

Dependability  assessment  of  safety  related  systems  is  getting  more  crucial  every 
day.  More  and  more,  PESs  control  and  guard  critical  processes.  Nowadays  the 
insight  has  grown  that  dependability  of  electronics  depends  on  three  factors: 

•  Hardware. 

•  Software. 

-  Human  factor. 

All  three  of  these  have  to  be  examined  in  order  to  arrive  at  a  sound  assessment 
of  a  system.  The  following  sections  shortly  describe  the  state  of  the  art  in  these 
fields  relevant  to  PESANTE. 
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2.1  Hardware  Dependability 

As  ail  dependability  en^eering,  hardware  dependability  assessment  can  be  done 
qualitatively  and  quantitatively.  The  first  methods  used  were  essentially  qualitative 
(dependability  by  design),  llien  the  quantitative  approach  emerged,  amongst 
others  initiated  by  the  U.S.  Defence  organizations:  the  MIL-HDBK-217  [1].  This 
approach  has  gained  much  popularity  and  is  used  for  almost  all  PESs  at  the 
moment.  In  spite  of  its  popularity,  much  has  been  said  against  the  applicability  of 
the  handbook  [3].  By  this,  and  the  notion  that  safety  has  to  be  an  integral  part  of 
the  design  in  hig^y  dependable  systems  (making  things  impossible  is  better  than 
making  them  very  improbable)  the  qualitative  approach  gains  field  again. 

2.2  Software  Dependability 

In  software  dependability  engineering,  the  most  imdely  applied  methods  are 
quantitative.  Most  methods  use  data  on  debugging  and  failiire  times  for  estimating 
failure  rates  [4].  There  is  much  discussion  about  the  validity  of  the  approach  and 
its  correlation  to  reality.  Case  studies  have  shown  good  results,  but  the  question 
remains  whether  the  methods  give  valuable  clues  for  the  individual  case. 

23  The  Human  Factor 

The  notion  that  human  dependability  is  a  major  factor  in  the  functioning  of  all 
systems  is  rapidly  gaming  field.  Human  factors  play  a  role  in  aU  stages  of  a  design: 
specification,  implementation,  debugging,  mamtenance,  upgrading,  and  last  but  not 
least  use.  The  emphasis  on  quality  assurance  at  all  stages  is  one  of  the  examples 
of  the  effects  of  t^  notion.  The  other  way  around,  the  incorporation  of  data  on 
the  human  factor  in  an  assessment  of  a  system  is  in  development.  Numerous 
publications  [5]  discuss  the  matter  and  try  to  find  solutions  to  the  problem.  The 
results  are  promising.  It  has  become  clear  that  without  assessment  of  the  human 
factor  dependability  estimation  of  a  system  is  incomplete. 

2.4  Integral  Approach  Towards  Dependability  Assessment 

All  three  of  the  aspects  mentioned  above  play  a  role  in  dependability  estimation 
of  industrial  electronics.  This  poses  the  following  problems  to  the  assessor: 

-  Hnding  the  right  balance  in  assessment  of  the  t^ee  aspects. 

-  Combining  fintfings  on  the  three  aspects  into  one  assessment.  The  findings  may 
be  of  quite  different  nature.  Also  some  information  on  an  aspect  may  be  lacking. 
Two  examples  may  clarify  this:  sometimes  information  has  to  be  discarded 
because  the  tool  used  cannot  handle  the  kind  of  information,  also  sometimes 
analysis  methods  demand  data  for  arriving  at  an  answer,  even  if  the  data  b  not 
available.  Essentially,  most  information  one  can  gather  on  a  system  b 
heterogeneous:  ha^g  different  units,  scales,  and  importance.  Thb  poses  hard 
demands  on  the  assessor  and  also  on  hb  toob. 
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•  Assessing  the  value  of  information.  If  a  certain  amount  of  information  on  a 
system  is  available  and  one  is  not  yet  satisfied  with  the  results:  v^at  information 
should  be  gathered  next?  What  information  is  most  valuable  to  the  assessor  in 
order  to  improve  his  ordeal  E.g.  if  one  has  a  lot  of  information  on  hardware 
dependability,  extra  information  on  hardware  dependability  might  not  be  very 
valuable  as  opposed  to  information  on  the  human  factor.  How  can  an  assessor 
choose  ^ch  information  he  should  gather? 

This  tenden<7  to  try  to  balance  can  be  found  everywhere  now.  It  is  a  type  of 
thinking  towards  a  goal,  rather  than  thinldng  from  tools  [6].  The  goal  is  a  valid 
assessment  of  safety  performance  of  a  system.  PESANTE  integrates  some  of  the 
suggested  approaches  and  is  described  in  the  following  paragraphs.  Based  on  the 
insights  gained  in  the  course  of  the  project  tools  will  be  developed. 


3  Technical  Description 

The  PESANTE  project  is  a  combination  of  mathematical  and  knowdedge 
elicitation  techniques.  The  project  is  a  synthesis  of  approaches  from  various 
disciplines:  traditional  softv^e,  hardware  and  human  dependability  analysis, 
categorical  analysis,  elicitation  and  Bayesian  inference  techniques. 

To  implement  the  technique,  first  knowledge  elicitation  will  take  place,  using 
proven  techniques  as  the  Delphi  method.  This  will  concern  both  objective  and 
subjective  information  on  systems.  The  information  ori^ates  from  users  and  all 
others  familiar  with  the  performance  of  the  PESs.  These  people  are  asked 
questions  illuminating  the  quantities  under  study  from  different  viewpoints.  The 
idea  is  that  these  people  all  have  information,  but  cannot  combine  this  data  into 
the  metrics  needed.  PESANTE  will  systematically  help  them  to  do  that. 

Also  input  to  the  categorical  analysis  are  the  results  of  standard  assessment 
techniques  as  MIL-HDBK-217  calculations,  software  dependability  measures  as 
MTTF  estimations,  and  human  factor  assessment  techniques.  The  technique 
proposed  is  to  be  a  step  beyond  these  methods:  it  is  to  combine  all  results  into 
one  assessment. 

3.1  Categorical  Analysis 

The  technique  to  analyze  the  information  is  a  categorical  analysis  [7].  This  will 
quantitatively  reveal  correlations  between  (combinations  of)  characteristics.  If  by 
example  a  certain  test  strategy  is  necessary  to  realize  a  certain  maintainability,  this 
correlation  will  show  up.  Later  these  relations  will  be  used  the  other  way  around: 
to  asses  the  system  from  its  characteristics  [8,9]. 

In  PESANTE  categorical  analysis  will  be  used  to  elicit  relations  between 
heterogeneous  data  and  dependability  aspects.  Categorical  analysis  is  available 
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now  for  some  time,  and  mainly  used  in  psychology  and  sociology,  v^ere  complex 
reladonships  are  common  (as  well  as  the  need  to  unravel  them).  The  technique 
is  used  to  reveal  those  relationships. 

Qiaracteristic  for  categorical  analysis  is  its  ability  to  cope  with  heterogeneous  and 
missing  data.  This  enables  PESANTE  to  include  all  sorts  of  data  available  on 
PESs  in  safety  applications. 

32  Bayes’  Reasoning 

¥<x  assessment  the  relations  between  PES  characteristics  and  dependability 
aspects  as  found  using  categorical  analysis  will  be  implemented  using  Bayes’ 
reasoning  [10].  This  tool  use  Bayes’  reasoning  to  conclude  on  dependability 
aspects  ba^  on  ^ven  input.  This  input  does  not  need  to  be  complete,  the  tool 
will  be  able  to  cope  \^th  missing  data.  As  categorical  analysis,  the  technique  used 
is  not  new.  It  is  used  with  success  in  other  applications  like  fault  diagnosis.  Only 
few  attempts  for  using  artificial  intelligence  in  the  field  of  dependability 
engineering  have  been  done  until  now.  The  advantage  of  using  Bayes’  reasoning 
are: 

-  It  can  cope  with  missing  data. 

•  It  can  ^ve  estimations  of  the  value  of  data. 

•  It  can  give  an  impression  of  the  reasoning  behind  a  result:  how  did  the  system 
come  to  this  conclusion. 

3  J  Combination  of  Categorical  Analysis  and  Bayes’  Reasoning 

In  short,  the  techniques  mentioned  above  will  be  combined  in  PESANTE  as 
follows. 

Procurement  of  information  on  PES  functioning  ^  be  done  using  techniques 
readily  available  as  the  Delphi  method  and  other  expert  opinion  collection 
techniques. 

Categorical  analysis  will  be  used  to  elidt  relations  between  heterogeneous  data 
and  dependability  aspects. 

After  elidtation  of  the  relations,  a  model  based  on  Bayes’  reasoning  will  be  tuned 
accordingly.  The  resulting  tool  will  assess  future  PESs. 
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4  Conclusions 

The  PESANTE  project  will  deliver  an  integral  approach  to  dependability 
assessment  of  Programmable  Electronic  Systems.  Valuable  are  the  following 
characteristics: 

-  It  will  be  able  to  combine  heterogeneous  data  on  the  PES  under  consideration 
in  order  to  arrive  at  a  sound  assessment. 

-  It  will  be  able  to  determine  the  value  of  data:  what  data  is  most  valuable  to 
arrive  at  a  better  assessment.  This  makes  sure  that  a  balanced  assessment  is  being 
made. 

-  It  will  be  able  to  cope  with  missing  data.  In  most  cases,  data  on  a  PES  are 
incomplete.  Most  methods  demand  input,  available  or  not.  PESANTE  will  base 
the  assessment  on  the  data  available. 
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Abstract 

This  paper  firstly  analyses  the  decision  making  of  a  supplier  in 
applying  for  certification.  The  process  of  safety  assessment  as  seen 
from  the  supplier  side  is  described^  in  particular  the  experience  of 
being  assessed  by  TUV  to  Safety  Requirements  Qass  5/6  of  DIN  V 
VDE  0801. 

By  describing  in  detail  the  various  phases  of  the  assessment  it  is 
hoped  that  by  this  paper  an  aid  to  other  would  be  applicants  is  given 
in  easing  their  assessment  path.  The  paper  firstly  reviews  other 
national  and  international  standards  and  schemes  and  looks  forward 
to  the  possibility  of  a  unified  ISO  standard  and  assessment  procedure. 

In  conclusion  the  paper  briefly  analyses  the  cost/benefit  of  assessment 
and  certification. 


1.  Introduction 

As  a  supplier  of  safety  critical  systems  to  both  the  Nuclear  and  Petrochemical 
industry,  August  Systems  begun  in  early  1990  to  review  the  case  for  independent 
safety  certification.  From  this  review  it  was  decided  to  proceed  with  safety 
certification  for  a  Triple  Modular  Redundant  Software  Implemented  Fault  Tolerant 
Safety  System.  This  paper  analyses  the  reasons  for  this  decision  and  seeks  to 
provide  aid  and  guidance  to  other  suppliers  in  the  path  of  safety  certification. 

The  path  of  independent  safety  certification  is  a  costly  and  time  consuming  process 
and  no  vendor  should  set  out  on  this  path  without  the  resolve  and  financial 
resources  to  see  the  process  through  to  the  end. 


2.  Assessor  and  Certification  Choice 

Worldwide  there  are  a  small  number  of  authorities  that  will  provide  safety 
certification  to  various  national  and  application  directed  standard.  Currently  there 
are  no  international  standards  for  safety  systems  to  be  assessed  too  and  therefore  the 
first  decision  that  any  vendor  must  make  relates  to  the  choice  of  assessor  and  the 
choice  of  standard  to  be  assessed  against. 

The  market  to  which  the  relevant  safety  system  is  targeted  will  often  influence  the 
choice  of  assessor  and  standard.  For  nuclear  installation  each  major  country  has  its 
own  standards  and  regulatory  organisation,  eg.  the  National  Nuclear  Inspectorate 
(NNI)  in  the  UK.  This  reliance  on  application  and  national  standards  limits  the 
commercial  viability  of  a  more  generalised  safety  systems  approval. 

Other  industrial  safety  conscience  markets  such  as  Petrochemicals  rely  on  standards 
of  codes  and  practice  such  as  the  PES  1  and  2  guidelines  issued  by  the  Health  and 
Safety  Executive  (HSE)  of  the  UK  and  the  Engineering  Equipment  and  Materials 
Users  Association  (EEMUA)  guidelines,  however  more  recently  for  programmable 
systems  an  acceptance  of  TUV  Certification  to  Class  5  and  6  has  been  considered 
appropriate  by  a  number  of  major  operators  in  this  field. 

Industry  specific  standards  and  certifying  authorities  such  as  the  Federal  Aviation 
Authority  (FAA)  in  the  USA,  and  the  Civil  Aviation  Authority  (CAA)  in  the  UK, 
have  obtained  international  acceptance  on  a  wide  scale  for  flight  safety.  To  a  lessor 
extend  the  national  certification  authorities  for  railway  signalling  have  also  achieved 
some  international  recognition. 

As  it  can  be  seen  the  choice  is  wide  and  is  normally  governed  by  a  combination  of 
both  the  targeted  industry  and  the  national  authority.  In  the  international  market  of 
petrochemicals  August  Systems  chose  TUV  as  the  certifying  authority  and  DIN  V 
VDE  0801  as  the  standard  for  safety  systems.  The  choice  was  governed  by  a 
combination  of  major  customer  acceptance  and  international  acceptability.  The  class 
of  certification  5/6  was  chosen  to  match  the  market  for  the  majority  of  safety 
systems  for  the  petrochemical  market.  Higher  classification  systems  such  as  7/8 
may  be  readily  obtained  by  a  combination  of  diverse  systems  in  highly  critical  plant 
areas  (the  references  provide  more  information  on  hi^  Integrity  Protection  System 
•  HIPS  where  class  7/8  safety  is  required). 

The  DIN  V  VDE  0801  specification  provides  a  risk  graph  aid  to  the  user  to 
determine  the  class  of  safety  for  the  application,  this  shown  in  Figure  1. 
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3.  Assessment  to  DIN  V  VDE  0801 


The  assessment  covers  three  basic  phases  whidt  are: 

a)  The  Design  Analysis 

b)  The  Hardware  and  Software  Inspection  and  Testing 

c)  The  Certification 

Each  of  the  three  phases  has  its  own  pre-requisite  and  imposes  different  work  loads 
on  both  the  vendor  and  the  assessor. 


3.1  The  Design  Analysis 


The  design  analysis  phase  for  DIN  V  VDE  0801  is  called  the  concept  review  and 
consists  of  three  distinct  parts;  requirements  class  selection,  documentation 
inspection  and  analysis  of  the  safety  concepts  of  the  system. 
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figura  1:  Risk  graph,  raquircmani  classes  14 


3.1.1  Requirement  Class  Selection 


By  holding  informal  reviews  with  the  vendor,  basically  enabling  the  TUV  staff  to 
understand  the  general  concepts  of  the  system  that  is  being  offered  for  certification, 
the  suitable  certification  class  is  selected  and  agreed.  From  this  point  onwards  all 
measurements  and  results  will  be  interpieted  in  relation  to  the  selected  requirements 
class.  The  levels  of  the  efficiency  of  the  measures  taken  for  safety  and  the  types 
of  failure  caused  are  simply  defined  by  Table  1  which  is  given  in  DIN  V 
0801. 
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table  1 


3.1.2  Documentation  Inspection 

For  a  system  to  be  capable  of  being  accredited  with  a  safety  classification  it  must 
be  thoroughly  and  accurately  documented.  The  certifying  authority  uses  this  phase 
to  both  validate  the  documentation  and  to  give  a  further  insight  into  the  operation 
of  the  submitted  system. 

For  the  vendor  this  period  will  almost  certainly  result  in  documentation  corrections 
being  instigated  and  if  the  system  is  not  thoroughly  documented,  additional 
documents  will  need  to  be  produced  to  provide  a  complete  documentation  set. 

The  results  of  the  section  will  confirm  or  otherwise  that  the  system  is  capable  of 
proceeding  to  the  next  phase  and  being  accredited  to  the  appropriate  class. 


3.L3  Analysis  of  Safety  Concepts 


The  vendor  will  be  required  to  submit  a  Safety  Concepts  Review  Document  that 
completely  describes  all  of  the  safety  aspects  and  concepts  of  the  submitted  system. 
This  document  will  almost  certainly  not  exist  within  the  vendor  organisation, 
however  the  information  to  produce  this  document  would  normally  be  readily 
available  from  the  vendors  standard  documentation  packages. 

This  document  plus  all  of  the  standard  software  and  hardware  documentation  will 
allow  the  certifying  authority  to  produce  a  system  level  failure  mode  effects  analysis 
and  will  also  provide  an  indication  of  the  effectiveness  of  the  measures  taken  to 
prevent  system  failure  or  fail  to  danger. 

When  the  system  has  successfully  passed  through  this  first  phase  the  indepth  testing, 
inspection  and  validation  can  proceed. 

3^  Hardware  and  Software  Inspection  and  Testing 

This  phase  of  the  certification  is  probably  the  most  intense  for  the  certifying 
authority,  it  consists  of  Hardware  Inspection  and  Validation,  Software  Inspection 
and  Validation,  Systems  Integration  and  Safety  Concepts  Validation. 

3.2.1  Hardware  Inspection 

The  certifying  authority  now  completes  a  detailed  inspection  of  every  hardware 
assembly.  This  inspection  includes  track  thickness  and  track  gaps  on  printed  circuit 
boards  as  well  as  a  complete  analysis  from  documentation  to  final  product. 

Detailed  failure  mode  effects  analysis  are  carried  out  on  all  circuits  where  safety  or 
fault  tolerance  are  seen  to  be  critical,  this  analysis  encompasses  all  critical 
component  tolerance  analysis  using  CAE  tools. 

The  Mean  Time  Between  Failure  calculation  provided  by  the  vendor  are  checked 
and  ffnally  the  system  is  type  tested  to  the  vendor  specification.  This  final  exercise 
is  normally  completed  shortly  before  certification  as  by  experience  TUV  have  found 
that  it  is  better  to  ensure  full  certification  is  achievable  before  type  testing  takes 
place. 


3.2.2  Software  Inspection 

It  is  at  this  stage  that  any  short  comings  in  the  design  approach  of  the  software  to 
be  accredited  will  become  evident.  For  the  rigorous  designers,  there  will  be  little 
to  do  but  perhaps  provide  additional  structure  guide  documentation. 


If  however  the  source  code  and  structures  are  not  well  documented  then  there  will 
be  a  significant  amount  of  back  documentation  to  be  completed  before  the  assessing 
authority  can  proceed.  It  must  be  remembered  that  the  assessors  will  have  little  or 
no  background  knowledge  about  the  design  to  be  assessed. 

The  process  of  assessment  indicates  a  thorough  inspection  of  the  documentation, 
data  structures  and  data  flows  against  quality  criteria,  programming  codes  of 
practice  and  consistency  against  the  specification. 

The  software  is  analysed  for  measures  taken  to  guard  against  faults  and  errors,  such 
techniques  as  defensive  programming,  on-line  software  test  routines  and  error  traps 
are  confirmed  to  be  present  or  otherwise,  and  this  enables  TUV  to  determine  the 
category  that  the  software  design  belongs  to. 

The  software  is  also  subjected  to  a  computer  aided  white-box  test  which  provides 
analysis  of  coverage,  run-time,  data  range  and  control  flow. 

3.2.3  Integration  and  System  Test 

By  this  time  in  the  evaluation  the  authority  will  now  have  an  intimate  knowledge 
of  the  detailed  workings  of  the  hardware  and  software  system  under  review.  This 
will  have  been  obtained  not  just  by  review  of  the  thorough  documentation  but  also 
by  a  series  of  meetings  held  with  the  vendors  engineers.  This  acquired  knowledge 
plus  the  experience  of  the  examiners  enables  the  certification  authority  to  complete 
a  large  number  of  fault  injection  tests  both  on  hardware  and  software. 

The  purpose  of  fault  injections  is  to  confirm  the  theory  of  operation  for  fault 
detection,  time  of  fault  detection,  fault  tolerance,  fail  safety  and  reparability. 
Injected  faults  include  simulated  hardware  failures,  actual  software  corruption  and 
simulated  fail  to  danger  scenarios  are  carried  out  and  each  result  logged  and 
analysed. 


4.  Application  Specific  Criteria  and  Final  Report 

With  the  results  of  the  test  complete  it  is  now  possible  for  the  evaluating  authority 
to  generate  a  final  report  which  defines  the  process  types  that  can  be  protected  (eg. 
processors  that  have  fail  safe  states),  the  main  time  that  faults  may  be  in  the  system 
undetected  and  any  special  requirements  of  configuration. 
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5.  Accreditation  Standards 

Currently  there  are  no  internationally  accepted  standards  to  which  accreditation  can 
be  achieved.  In  the  authors  opinion  it  will  be  several  years  before  a  fully  accepted 
international  standard  is  available.  Two  national  groups  and  one  international  group 
have  been  working  towards  producing  an  acceptable  standard  and  the  following 
provides  a  review  of  the  current  status  of  those  standards. 

5.1  Instniment  Society  of  America  ISA  SP84 

The  ISA  SP84  group  is  a  process  industry  group  that  has  been  working  on 

generating  standards  for  safety  control  applications  in  the  process  industries 
particularly  for  programmable  systems.  The  committee  consists  primarily  of  users 
and  a  few  selected  vendors  and  the  standards  are  in  advanced  state  of  production 
with  perhaps  the  exception  of  software  criteria.  Because  of  its  advanced  status  it 
is  possible  that  these  standards  will  be  submitted  for  international  recognition  in  the 
coming  two  years,  they  are  however  currently  limited  to  process  applications. 

5.2  DIN  V  VDE  0801 

This  specification  is  a  German  National  standard  which  is  not  industry  specific. 
The  standard  has  been  evolving  over  the  last  decade  and  its  current  status  gives  it 
acceptance  both  in  germany  and  by  certain  international  companies.  I  would  expect 
the  German  standard  groups  to  push  for  DIN  V  VDE  0801  to  be  accepted  as  the 
international  standard  for  safety  accreditation.  The  standard  certainly  does  provide 
the  basic  coverage  that  an  international  standard  will  need,  however  in  the  authors 
opinion  it  does  require  further  development,  specifically  in  the  area  of  safety 
metrics. 

S3  lEC  55A  Working  Groups  9  and  10 

These  European/Intemational  working  groups  have  been  operating  for  a  number  of 
years  in  an  attempt  to  produce  an  acceptable  ISO  standard.  The  two  working 
groups  have  been  contributing  to  different  aspects  of  safety  system  WGIO  provide 
a  generic  system  approach  and  WG9  concentrates  on  safety  software. 

It  is  hoped  by  the  author  that  these  standards  will  eventually  become  accepted  by 
ISO  alttough  my  expectation  is  that  at  least  2-3  years  of  further  work  is  required 
and  probably  a  degree  of  amalgamation  with  both  the  SP84  and  the  VDE  0801 
standard  are  required. 
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6.  Conclusion 

With  no  one  international  standard  to  accredit  against,  the  drive  to  invest  in 
accreditation  by  a  safety  system  supplier  will  often  be  governed  by  his  customers. 
If  a  sufficient  number  of  customers  continue  to  push  for  accreditation  by  a 
recognised  authority  then  the  supplier  will  need  to  make  a  commercial  decision  as 
to  whether  the  investment  in  accreditation  is  necessary. 

The  cost  of  accreditation  is  of  course  dependent  on  the  size  and  complexity  of  the 
equipment  submitted,  but  will  almost  certainly  fall  within  the  range  of  $50,000  - 
$250,000. 

Additionally  there  is  the  ongoing  cost  to  ensure  that  modifications  and  updates  are 
re-accredited. 

From  the  suppliers  viewpoint  one  internationally  accepted  standard  with 
accreditation  authorities  in  all  of  the  major  industrial  countries  would  be  the 
prefened  outcome.  Until  this  is  achieved  we  will  have  to  continue  to  make 
commercial  decisions  on  when  and  where  accreditation  is  achieved. 
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1  Introduction 

There  is  an  increasing  use  of  computing  in  safety  related  applications  and  ensuring 
that  such  systems  are  conceived,  designed  and  produced  with  appropriate  attention  to 
safety  is  not  easy.  The  process  of  identifying  undesirable  events  and  their 
consequences  is  blown  as  hazard  analysis.  Carrying  out  a  hazard  analysis  at  varying 
stages  during  the  development  process  is  now  bring  mandated  in  emerging  standads 
produced  by  the  Interomional  Electrotechnical  Commission  (lEC)  [1]  and  the  UJC. 
Ministry  of  Defence  [2]. 

computer  based  systems  there  is  the  particular  problem  that  design  faults 
tend  to  dominate  random  hardware  faults  and  so  identifying  potential  hazards  early  in 
die  design  process  is  cruciaL  There  are  well  established  methods  for  carrying  out 
hazard  analysis  in  many  domains  of  engineering  such  as  petro<chemical  but  it  is  not 
dear  that  sudi  methods  are  readily  ^tfdicaUe  to  computmg  based  systems.  Furdier, 
there  is  little  advice  available  to  assist  with  haziud  analyses  using  computer 
technology. 

Toaddress  the  above  problem  we  have  modified  the  Hazard  And  OPerabili^ 
(HAZOP)  technique  [3]  and  used  it  successfiilly  in  the  computing  domain.  Our 
evolutioo  the  HAZOP  technique  has  taken  place  over  a  number  of  years  and  most 
qifdications  have  been  Commercial  in  Ccmfidence.  However,  a  medical  imaging 
application  is  being  studied  as  a  collaborative  research  project  and  serves  as  an 
appropriate  case  study. 

The  remainder  of  this  pqper  eiqilains  our  conclusion  that  a  modified  HAZOP 
is  an  effective  wqr  to  carry  out  hazard  analysis  on  computer  based  systems.  Wefind 
that: 


The  HAZOP  method  is  well  establiriied  and  fits  well  into  overall  safety 

The  mediod  can  be  extended  to  the  computer  software  field 

Hie  eqierience  to  date  is  that  the  ^iptoach  is  powerfiil  and  we  illustrate  this 

with  die  medical  case  study. 


l.The  case  study  rqported  here  is  part  of  a  collaborative  project  between  The  Centra  for 
Srikware  Engineering  limited,  Cambridge  Consultants  Limited  and  The  Hun^m  Genetics 
Ihnt  of  The  Medical  Researdi  Council 
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To  pnt  the  results  in  oooiext  the  next  section  gives  a  desorption  of  the  experimental 
medical  qrsieni.  Subsequent  sections  deal  with  our  efforts  to  develop  a  modified 
HAZOP  and  a  case  stu^  of  die  results  obtained  through  its  qiplicadon  to  die 
esqierimental  medical  system. 

2  The  Medical  Diagnostics  Application 

There  are  many  laboratories  worldwide  which  carry  out  cervical  screening  by  expert 
manual  inflection  of  sUdes  prqiared  from  smear  samples.  The  purpose  of  screening 
is  to  recognise  the  presence  of  abnormal  cells  in  a  sample  which  may  contain  few 
such  cells,  and  is  likely  to  contain  possibly  confusing  other  matter.  Usually,  the 
review  is  manual  and  indudes  an  assessment  of  the  degree  of  abnormality,  leading  to 
diagnosis  and  treatment  For  screening  of  healdiy  patients  on  a  regular  basis,  diere 
are  likely  to  be  few  abnormal  cells  but  for  diagn^c  screening  of  a  sick  patient 
abnormal  cells  win  be  eigiected. 

The  UJC.  Medical  Research  Council  Human  Genetics  Unit  in  Edinburgh 
CIK3U)  has  devdoped  over  a  number  of  years  a  semi-automated  screening  system  for 
cervical  smears  [4].  The  basis  of  the  automated  system  is  the  computer  analysis  of 
a  slide  prepared  from  the  smear  sample.  The  firf  version  of  the  system  reded  on 
custom  hardware  to  capture  and  process  images  and  the  HGU  are  now  in  the  process 
of  re-engineering  the  system  to  run  on  a  modom  computing  platform  without  the 
use  (tf  custom  hardware.  The  collaborative  project  of  which  the  work  reported  here 
is  part  involves  the  re-implementation  of  part  of  the  new  system  currently  under 
developnienL 

The  sysinn  relies  on  efficient  image  processing  and  classification  algorithms 
to  find  the  almarmal  cells.  An  initial  search  at  high  f)eed  is  carried  out  at  low 
resolution  to  identify  suspicious  objects  (an  (foject  is  something  that  appears  to  the 
software  to  have  cell-like  characteristics  but  may  not  in  fact  be  a  cell).  A  second 
pass  is  dien  done  at  higher  resolution  so  that  a  more  detailed  analysis  can  filler  and 
laidt  the  objects.  The  top-ranking  otgects  are  classified  and  a  decision  made  to  class 
diem  as  normal,  ask  for  a  skilled  human  review,  or  pass  for  a  full  conventional 
review. 

Outside  the  core  imaging  system,  there  are  various  preparatory  and  post¬ 
analysis  processes  to  be  taken  into  ctmsideration.  The  former  include  administrative 
tasks  to  handle  samples  and  accompanying  paper-worit  sent  from  General 
Practitioners  and  clinical  agencies.  This  is  then  followed  by  slide  prqiaration  and 
die  transfer  of  riides  to  die  automated  imaging  tystem.  Configuration  may  then  be 
necessary  to  initialise  certain  batch  processing  control  parameters  before  the 
automated  process  itself  can  be  started.  Once  this  has  been  completed,  there  follows 
a  certain  amount  of  ddying-iqi  before  results  are  passed  for  diagnostic  review, 
signiog  out  or  some  form  of  quality  control  check.  Results  of  the  analysis  must 
finally  be  returned  to  the  clinic  which  stqiplied  the  sample  and  the  analyzed  slide 
placed  in  a  long  term  ttchive. 

The  core  image  processing  part  of  this  ^iplicadon  forms  the  case  study  for 
our  apfdicadon  of  HASOP  to  a  software  based  system. 
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3  HAZOP  Studies  are  an  Effective  Part  of  Safety 
Management 

This  aectioo  describes  the  process  of  ensuring  diat  safety  is  considered  during  all 
phases  of  die  life  of  a  system,  the  safety  lifecycle.  It  dien  gives  an  overview  of  one 
of  die  established  methods  for  identifying  hazards,  fee  HAZOP,  followed  by  some  (rf 
fee  advantages  fee  mediod. 

3.1  The  Safety  Lifecycle 

In  a  number  of  established  industries,  patticulariy  petrochemical,  fee  possibility  of 
feiluies  having  an  adverse  impact  on  safety  has  been  recognised  for  many  years. 
However,  it  was  not  until  a  number  of  accidents  and  near  accidents  had  takra  place 
feat  it  was  recognised  feat  a  systematic  approach  to  die  management  of  safety  was 
required.  The  iqiproach  takra  is  indep^ent  of  the  industry.  To  clarify  the 
following,  two  definitions  are  given,  both  taken  from  Ref  [1]: 

Ahazard  is  aphysical  situation  wife  a  potential  for  human  injury. 

Risk  is  the  comlnnatioo  of  the  frequency,  or  prob^ility,  and  the  consequence  of  a 
qiedfied  hazardous  event 

The  basic  siqB  in  the  overall  safety  lifecycle  are: 

•  System  definitxKi  generating  an  overall  descriplioo  of  the  system  under  review 

•  Hazard  analysis  to  identify  die  potential  for  hmardous  events 

•  Risk  analysis  to  judge  fee  safety  risk  of  the  defined  system.  This  quantifies 
the  potent  for  hazardous  events  and  evaluates  femr  consequences. 

•  Judgement  of  the  accqitability  of  the  risk 

•  Acdvities,  if  necessary  to  led^  die  risk  to  an  accqitable  leveL  This  might 
be  by  modifying  the  architecture  of  die  system,  including  extra  measures  to 
avoid  or  contain  safety  failures,  or  ensuring  that  the  system  is  built  to 
standards  feat  ate  appropriate  to  die  level  of  risk. 

•  Implementing  fee  system  to  the  required  standards,  followed  by  effective 
operation  and  maintenanoe 

In  recent  years  diere  has  been  a  huge  growth  in  the  use  of  computers  and,  as  we  are 
all  aware,  all  computer  systems  are  liable  to  contain  design  mistakes.  The  fact  of 
the  fidUbUity  ttf  computer  systems  when  used  in  safety  critical  situations  is  now 
being  addreamd  formally  in  standards  woric.  Two  new  standards  have  been  drafted: 
one  from  the  International  Electrotechnical  Commission  addresses  functional  safety 
of  computer  based  systems  [1]  and  the  other  from  the  UJL  Ministry  Defence, 
addresses  hazard  anatysis  for  oooqnitcr  based  systems  [2]. 

Both  the  standards  use  the  above  safety  lifecycle  and  stress  the  importance  of 
carrying  out  hazard  analysis.  The  traditional  industries  have  developed  a  nninber  of 
structured  methods  to  help  ensure  that  fee  hazard  analysis  is  oompl^  and  thorough. 


aich  metfKKb  indude  Vhtt  if  analysis,  failure  inodes  and  effects  analysis  (FMEA), 
and  HAZOP.  The  next  section  introduces  one  of  the  most  effective  hazard  analysis 
methods.  HAZOP. 

3,2  The  Traditional  HAZOP 

The  foil  name  of  HAZOP  is  Hazard  and  Operability,  and  this  gives  pointers  to  the 
two  focets  of  its  purpose.  The  HAZOP  ensures  both  that  features  that  could  lead  to 
undesirable  outcomes  (ie  hazards)  are  avoided  and  that  necessary  features  are 
incorporated  into  die  design  for  safe  operation.  Tlie  method  was  developed  in  the 
U  JL  by  ICI  in  the  late  1960*s  and  is  well  established  in  the  petro-chemical  sector. 
A  reference  for  its  use  in  that  industry  is  givoi  in  [3]. 

In  these  industries,  the  plant  design  is  normally  described  by  piping  and 
instrumentation  diagrams  (P&n)s).  The  HAZOP  study  is  carried  out  by  a  team  of 
knowledgeable  engineers  who  carry  out  a  systematic  examination  of  the  design. 
They  postulate,  fet  each  element  of  the  system  design  in  the  PAIDs,  deviations 
from  the  normal  opoating  mode  and  then  assess  the  consequences  of  those 
deviations  with  respea  to  any  safety  or  operability  problems. 

For  each  deviation,  the  team  asks  ‘can  it  happen?*  and,  if  it  can,  is  this  likely 
to  lead  to  a  hazard.  The  team  will  take  into  cmisideratimi  any  mitigating  features, 
such  as  cmitrol  valves,  alarms  etc.  which  mi^t  cmitrol  the  hazard.  To  formalise 
the  process,  a  series  of  guidewords  are  used  to  define  particular  deviatxms  and  these 
are  apidied  to  each  relevant  parameter  for  each  jnocess  compemenL 

Thus,  for  fluid  flowing  in  a  pipe,  relevant  parameters  might  be  flow, 
pressure,  temperature  and  deviations  examined  would  include  high,  low,  no, 
reverse,  as  well  as.  In  theory,  each  guideword  should  be  iqjplied  to  each  relevant 
parameter  for  each  part  of  the  process  descripdmi.  In  {uactice,  this  is  very  dme- 
coosuming,  and  an  experienced  HAZOP  leader  will  use  judgement  to  control  the 
conectdecul  rtf  questkming  in  each  area. 

The  results  can  be  ivesented  in  a  number  of  ways  and  we  have  found  a 
software  package  devdoped  by  our  parent  company,  Arthur  D  Little,  running  on  a 
peismial  computer  to  be  effective  [^.  Results  are  presented  under  the  following 
headings: 

Itmn  number  -  a  simple  count  of  items  logged  from  the  beginning  of  the  HAZOP 
Eqn4nnent  item  -  a  description  of  the  area  of  plant  for  which  a  deviation  has 
be»  found 

Parameter  •  such  as  tempoature,  pressure,  flow  etc. 

Guideword  for  Deviation  -  such  as  high,  low  etc. 

Canse  -  the  circumstances  that  could  give  rise  to  the  deviation 
Consequence  -  the  effect  on  the  plant  that  the  deviation  might  lead  to 
Indication/Protection  -  any  feature  that  will  either  identify  the  deviation  ot 
mhiipite  it*s  effects  (fig  an  alarm  signal  or  a  pressure  relief  valve). 
Qnestion/Recommendation  •  questions  arise  from  items  considered  a  potential 
hazard  idiich  cannot  be  resolved  by  the  meeting.  Recommendations  are  generally 
for  changes  to  die  design  or  particular  actions  to  be  taken  during  operation. 
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Auwers/Conineiits  •  This  allows  the  later  inseition  oi  answers  to  questicms 
raised  or  notes  which  the  team  consider  relevant  to  Ae  design  but  are  not  questions 
orieoommendatinns. 

The  HAZOP  team  is  nonnally  small,  four  to  eight  people,  comprising  a  leader,  a 
team  secretary,  people  who  understand  the  design  intent,  people  who  are  ejqwrienced 
in  the  operation  similar  plant  and  qjecific  technical  experts  as  necessary. 

33  Advantages  of  HAZOP 

There  are  three  main  advantages  of  die  HAZOP  tqjproach  to  dmng  hazard  analysis:  it 
is  carried  out  by  a  team;  it  deals  with  the  design  in  a  systematic  top-down  manner; 
and  it  is  capable  ci  being  ^iplied  during  all  phases  of  a  system  lifecycle. 

The  team  approach  allows  a  variety  of  expertise  and  viewpoints  to  be  applied 
to  die  system.  Our  experience  is  that  many  problems  are  caukd  by  interactions 
betwemi  parts  of  the  system  and  by  differing  understandings  of  designers  and 
operators.  The  team  iqiproach  allows  such  issues  to  be  explored  and  there  is  less 
impact  to  a  mistake  by  one  team  member.  Having  key  personnel  on  the  team 
means  that  any  problem  areas  are  brought  immediately  to  their  attention. 

Dealing  with  the  system  in  a  top-down  manner  allows  concentration  on  the 
key  issues  arising  from  the  potential  hazards  and  allows  system-wide  implications  to 
be  assessed.  Having  a  clear  view  of  the  whole  system  allows  the  team  to  explore 
non-obvious  interactions  between  pans  of  the  system.  This  is  in  ctxitrast  with 
bottom-op  approaches  such  as  FM^  which  must  analyse  evmy  compmient  to  a 
similar  level  cddttail  which  is  time-consuming  and  error  pn»e.  We  have  found  that 
die  HAZOP  is  useful  in  identifying  which  are  the  most  critical  areas  to  concentrate 
on  in  any  later  FMEA 

Tte  ipproach  may  be  used  during  all  phases  of  the  life  of  a  system.  We  have 
used  it  successfully  both  at  the  conceptual  design  stage  and  for  safety  assessment  on 
completed  systems. 

4  HAZOP  Has  Been  Modified 

The  positive  eiqierience  of  Arthur  D  Litde  in  using  HAZOP  made  it  their  technique 
of  c^oe  for  hazard  analysis  in  the  petro-chemical  industry.  In  recent  years  they 
have  worked  with  their  subsidiary  company,  Cambridge  Consultants,  to  explore 
whether  die  HA2X)P  method  could  be  extended  to  other  domains,  particulariy  those 
where  electronics  and  computers  were  involved.  It  became  clear  that  work  was 
needed  in  two  different  areas;  first  to  find  apprqiriate  rqxesentatimis  and,  second,  to 
see  what  were  an  appropriate  set  of  parameters  and  devimioo  guidewords. 

4.1  An  Effective  Representation  Method  is  Needed 

¥ndi  couquner  based  systems  there  is  not,  obviously,  a  PAID  to  woric  from.  For 
electronic  designs  there  are  clear  parallels  with  block  diagrams  and  circuit  diagrams 


aod  diere  is  much  unifonnity  of  rqxesentadon.  Tiransfer  of  the  q)pioadi  was  thus 
straightforward.  This  was  not  the  case  with  software,  lliete  are  many 
rqxesentational  methods  which  involve  diagrams  and  some,  using  mathematics, 
i^iidi  have  no  pictorial  representation.  We  have  found  that  the  HA2^Pq)proach  is 
tractable  widi  a  variety  of  pictorial  methods  but  that  foe  dataflow  rqxesentation  of 
structured  design  is  die  most  natural  to  wtsk  with. 

With  our  exenq>lar  diagnostic  image  processmg  system,  little  in  die  way  of 
design  documentadmi  was  available  and  so  we  worked  wifo  the  system  designers  to 
build  a  dataflow  representation  of  the  system  using  a  CASE  tod  called  Software 
through  Pictures  [Q.  Hgure  1  shows  the  first  level  of  deconqiosition  from  die  top 
level  context  diagram. 


Flgiirr  1:  Cenical  Scrccnliig  System 

The  asterisk  in  process  2  indicates  that  a  lower  level  decomposition  has  been 
generated.  The  three  processes  show  the  activities  d; 

•  Slide  prqiaiadoQ' preparing  slides  fiom  smear  samples 

•  Slide  screening- die  oonqNiter  based  imaging  and  diagnostic  process 

•  Full  review  •  manual  expert  checking  of  slides  recognised  as 
suqncions  by  the  computer  systmn 

Figure  2  shows  a  simplified  version  of  part  of  a  lower  level  process,  the 
process  of  carrying  out  a  scan  of  the  slide  at  low  resolution.  One  of  the  sub 
processes.  Bright  I^ld,  is  used  below  to  illustrate  foe  HAZOP  mediod  so  its 
ftmction  will  be  described  here.  Four  or  five  frames  are  cqitured  by  using  a 
microscope,  digital  camera  and  a  qiecial  fiamestore  supporting  di^  memoty  access 
of  reduced  scale  images  (typically  every  5fo  pixel).  The  process  performs  a  logical 
OR  operation  on  the  cqpitu^  firiunes  to  determine  maximum  light  intensity  on  a 


pixd-byiiixdtasit.  TUs  data  can  then  be  used  to  conect  object  images  for  optical 
deasiQr  (diat  is,  making  diem  invariant  to  fomp  briefness  and  variation  in  field 


Figure  2:  Partof  Ijow  Resolution  Search  Scan 


We  found  that  die  use  of  the  dataflow  paradigm  and  a  CASE  toed  enabled  an  accntaie 
and  easy  to  understand  iqxesentadonctf  the  design  to  be  produced  quickly.  Onettf 
die  main  advantages  oi  die  dataflow  qiproach  we  have  found  over  a  number  isi 
projects  is  that  the  design  is  readily  understandable  by  all  interested  parties,  even 
dwae  without  any  conqmting  backjpound. 


4,2  New  Guidewords  Have  Been  Derived 


Since  the  software  and  hnageixocessing  domanis  differ  significandy  from  petro* 
dirnricai  engineering,  we  fiwnd  that  the  standard  HAZCyguidewoidsdM  not  provide 
a  rich  enouih  set  Over  a  number  of  projeds  [7]  we  have  evolved  a  set  which  is 
generally  apfdicable  to  software  based  systems  and  examples  (rf  these  are  shown  in 
TaUe  1  bdow.  The  full  vocabulary  of  guidewords  was  larger  and  included  words 
that  were  oidy  used  on  a  few  occasions.  These  arose  diroogh  die  oomidex  mix  of 

in  ffif  ifmghig  fyffftn  ^iry^  nfgd  Iff  1  p— 

deviation.  The  set  is  evolving  as  our  experience  increases  and  we  believe  that 
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Table  1:  PBiamdera  and  their  DevialJcrnGuidewords 


5  The  HAZOP  Process  Worked  Well  in  Practice 

5.1  Results 

The  HAZOP  was  carried  out  by  a  team  of  four  pec^le  including  an  experienced 
leader  and  one  the  designenctf  the  reengineered  diagnoftic  system.  'nieHA2XP 
was  of  the  foil  computer  based  image  processing  system  and  some  100  items  were 
noted  during  the  work. 

The  HAZC^  was  qiplied  In  a  top-down  manner  so  that  each  process  was 
takes  in  die  numeric  order  fo  iriuch  it  occurred  in  the  dataflow  deconqwsitioa:  fost, 
deviatioos  of  the  input  dataflows  were  explored,  followed  by  examination  of  the 
proceas  itidf  which  nanaally  defined  the  possible  deviatioas  and  their  consequences 
for  the  process  oofouts.  data  items  were  used  as  inputs  to  more  tfaim  one 
process  and  of  course  they  oidy  needed  to  be  addressed  in  detail  once  during  the 
HAZC^.  Thus,  as  each  new  level  was  examined,  the  data  passed  down  from  a 
h^her  level  was  mariced  and  only  reviewed  briefly  to  chedc  that  possible  cooiqjtioos 
or  tiansfonnaiioas  had  not  been  overlooked.  Some  generic  conclusions  from  die 
results  win  be  given  and  then  some  of  the  specific  items  from  the  process  Capture 
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Bright  Field  described  above. 

The  generic  conclusions  may  be  divided  into  three  themes:  First,  efficiency 
led  gains  in  processing  speed  have  to  be  balanced  against  the  potential  for  hazards 
due  to  incomplete  sampling.  To  achieve  die  benefits  of  automatic  screening  the 
system  throu^qait  must  be  high  and  the  necessary  use  of  cmnplez  image  processing 
algoridims  and  a  series  of  filters  means  that  not  ah  a  sample  miy  be  scanned. 

Second,  recognition  of  abnormal  cells  can  arise  from  hardware  faults  and 
cunentsctftwaeimplementatkm  practice.  Hazards  may  arise  tfarou^  data  corrtgitKHi 
either  leading  to  overiotddng  of  abnormal  cells  or  through  recognitimi  of  normal 
cells  as  abnormal  The  experimental  approach  to  date  has  concentrated  on  algorithm 
research  using  the  most  cmivenient  tools  and  mediods  and  has  given  less  attention 
to  achieving  high  levels  of  safety  integrity. 

Thirdly,  key  data  items  for  stt-tqi  and  opoadon  must  maintain  their  int^ty. 
In  poticular,  processing  histories  are  kqH  for  a  number  of  days  ca  a  roUing  ba^ 
and  classifier  learning  data  is  used  by  algorithms:  these  and  similar  data  must 
mainrain  dieir  integrity  so  diat  consistent  system  operaticm  and  correct  tracking  of 
patient  details  are  ensu^ 

The  process  Bright  Field  shown  in  Hg  2  is  used  to  illustrate  the  HAZOP. 

Recommendation  59  Bright  field  high  cell  density  could  lead  to  incorrect  OD 
calculatirm.  effect  could  be  a ‘blind  spot*.  Carry  out  sensiUlity  check. 

Qnestion  60  Bright  field  drift  of  lamp  intensity  could  lead  to  missed  objects. 
How  can  this  be  checked  Cor? 

Qnesrion  61  Bright  field  Dq)ends  on  light  maxima  which  could  be  too  high  if 
slide  samide  acts  as  lens,  forming  ring  rm  image.  Can  odier  algorithms  be  used? 
Question  62  Bright  field  if  value  too  high,  too  low  or  randomly  incorrect, 
subsequent  OD  corrected  images  will  be  corrupted.  Canthisbechedcedfor? 

It  should  be  noted  that  this  was  a  HAZOP  of  an  experimental  system, 
developed  to  show  feasibility  and  it  is  not  surprising  that  we  found  the  system 
lacking  in  a  number  of  areas.  The  value  for  the  design  team  is  that  they  believe 
team  members  are  now  well  placed  to  address  hazard  concerns  in  order  to  mitigate 
areas  of  potential  risk. 

6  The  Approach  Has  Advantages  and  Limitations 

The  advantages  of  using  a  HAZOP  described  in  Section  33  above  have  been  found 
to  be  indqtendent  of  the  application  domain  and  of  the  technologies  involved.  For 
the  computing  domain  d^^bed  in  this  pq)er  the  qtplication  of  the  method  was 
novel  to  the  HGU  persoimel  involved  The  combination  of  the  dataflow 
rgnesentarion  and  the  systematic  approach  to  examining  deviations  that  could  lead 
to  hazards  worired  effectively.  Oace  the  team  became  used  to  the  method  (our 
erqterience  on  this  and  other  projects  is  diat  fruniliarisation  takes  about  half  a  day) 
productivity  was  high  and  die  full  HAZOP  of  the  system  took  only  a  few  days. 
This  hi^  productivity  and  early  focusing  on  important  issues  is,  we  believe,  a 


nugar  advantage  of  the  HAZOP  as  contrasted  wiUi  boaom-iq)  qproaches. 

Hiere  aie,  however,  limitations  to  the  iq;)proach.  In  ^  process  industries 
gnidewonb  for  deviations  from  design  intent  are  well  established  and  indication  is 
reascmably  straightforward.  With  software  based  systems  we  have  found  that 
application  is  less  easy ; 

•  The  ejqwrtise  of  the  HAZOP  leader  is  crucial  in  focusing  die  discussion  on 
potential  hazards  and  not  to  eaqdoting  *interestiiig*  areas. 

•  Slavish  adherence  to  the  guidewords  is  not  sufficient  We  have  found  that 
significant  flexibility  and  multi-disciplinary  des^  eaqierieoce  by  the  team  is 
neoessaty  to  ejqdofe  unusual  interactkms. 

•  hdqwBdenttedinical  experts,  experienced  in  safety  critical  ccmqwiter  system 
design,  are  necessary  to  gain  full  benefo. 

Ibus  we  have  found  that  the  HAZOP  qjpioach  provides  an  effective  way  to 
cany  out  a  systematic  and  cost-effective  hazad  analysis  but  that  the  use  of  technical 
ejqKdise  and  design  experience  is  still  required.  We  postulate  that  this  is  because  of 
die  inherent  complexity  of  software  based  systems  and  their  propensity  for  design 

COOT. 

References 

1  Functional  Safety  ttfElectricaVElectrmiic/Programinable  Systems.  Generic 
Aspects.  lEC  6SA  (Secretariat)  123.  1991 

2  lauximDciStaa  00-56,  Hazard  Analysis  and  Safety  Classification  of  the 
Gomputer  and  Programmable  Electronic  System  Etoents  of  Defence 
EquqanenL  UKKfini^  of  Defence  1991. 

3  A  Guide  to  Hazard  and  Operability  Studies.  Chemical  Industries  Association 
Limited,  1987. 

4  Husain  O,  Watts  K,LonimanFetaL  Semi-Autoroated  Cervical  Smear  pre¬ 
screening  system:  an  evaluation  of  the  Cytoscan-1 10.  Analytical  and  Cellular 
Pathology,  S:  49-68, 1993. 

5  HAZOPtimizer.  ArdmrD  Little,  Safety  and  Risk  Management,  Cambridge, 
Msssadiusetts. 

6  Software  duon^  Pictures.  Interactive  Development  Environments, 
Califoniia. 

7  Chudleigh  M,  Catmur  J.  Safety  Assessment  Computer  Systems  using 
HAZOP  and  Audit  Techniques.  In:  Frey  (ed)  Safety  ^Computer  Centred 
Systems  1992  (Safeoomp  *92)  pp  285-292. 


Session  4 


SAFETY  ANALYSIS 

Chair:  F.  Koomneef 
Delft  University  of  Technology,  NL 


Safety  Analysis  of  Clinical  Laboratory  Systems 

Authors 

S.S.  Dhanjal  -  Lloyd’s  Register 
R.  Fink  -  West  Middlesex  University  Hospital 


1.  Introduction 

The  Clinical  Biochemistiy  Departmnit  (CBD)  at  the  West  Middlesex  University 
Hospital  (WMH)  performs  tests  on  constituoits  of  body  fluids  to  facilitate  diagnosis, 
prognosis  and  monitoring  of  treatment.  A  rapid  increase  in  the  need  for  this  service 
in  recoit  years  has  led  to  extoisive  automation  of  the  analysis  and  data  handling 
operations  within  the  dqiartmmt.  The  automation  and  data  handling  have  been 
implemated  by  integrating  a  computerised  Laboratoiy  Information  Managemrat 
System  (LIMS)  into  the  operations  of  the  Laboratoiy. 

Quality  assurance  of  the  analytical  processes  is  well  established.  However,  in 
common  with  other  safety  related  disciplines  and  q>plications,  there  is  a  concern 
about  the  reliability  of  the  computerised  data  management  system  and  the  lack  of 
generally  accepted  standards.  These  issues  have  bem  addressed  as  part  of  the  DTI 
sponsored  MORSE  -  (Methods  for  Object  Reuse  in  Safety  Critical  Environments) 
project. 

WMH  is  a  member  of  the  MORSE  project  consortium  (Dowty  Controls,  Lloyd’s 
Register,  Transmitton  Ltd,  West  Middlesex  University  Hospital  and  the  University 
of  Cambridge).  The  MORSE  project  has  bem  inspired  by  recmtly  proposed 
standards  and  guidelines  [1, 2,3,4]  which  are  at  various  stages  of  development. 
These  standards  and  guidelines  apply  to  safety  critical  tystems  and  bring  together 
a  range  of  existing  procedures,  methods  and  design  practices  which,  in  combination, 
are  untried.  These  methods  and  guidelines  include  the  application  of  safety  analysis 
techniques  at  the  tystem  level  and  the  use  of  formal  specification  methods  for  the 
development  of  software.  The  <^>erations  of  the  CBD  have  bera  the  subject  of  a 
case  stiidy  within  the  MORSE  project  aimed  at  gaining  experimce  of  developing 
software  according  to  the  aforementioned  standards  and  guidelines. 

The  recommendations  resulting  from  the  safety  analysis  on  the  CBD  related  to  the 
software  are  to  be  reinqilemoited  using  the  RAISE  [5,6]  formal  specification 
mediod. 

This  paper  presents  the  experience  gained  in  defining  the  CBD  tystem  and  the 
application  of  hazard  idntification  techniques  witiiin  the  overall  safety  analysis. 
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2.  Operations  of  the  Clinical  Biochemistry 
Department 

The  CBD  provides  clinicel  and  laboratoiy  sovices  for  a  wide  spectrum  of  on-site 
and  off-site  users.  A  block  diagram  of  the  functions  performed  within  the  dqMutmoit 
is  presented  in  Figure  1 .  Patient  samples  are  received  in  one  of  two  rec^tion  areas 
where  they  are  given  a  unique  refermce  number  before  being  prqMued  for  analysis. 
Details  of  the  tests  required  and  the  patient  demogr^hics  are  altered  into  one  of 
three  tominals  which  are  connected  to  a  file  server  running  the  database  and 
network  software  programs. 

The  file  server  is  connected  to  a  number  of  woric-stations,  which  are  bilaterally 
interfiu^ed  to  conqiuters  which  in  turn  control  large  oqiacity  (7000  tests/hr) 
analyzers.  Ptdient  details  and  test  request  information  is  transferred  electronically 
from  the  woric-stations  to  the  analyzers  which  perform  the  analyses  required  along 
with  quality  control  checks  before  returning  validated  results  via  the  work  stations 
to  the  file  server  hard  disk.  Further  quality  and  validity  checks  are  performed 
before  test  results  are  printed  on  hard  copy  for  dispatch  to  clinical  staff.  The  data 
are  ftoaUy  archived. 

3.  Safety  Analysis  within  the  Morse  Project 

An  inqnovemoit  in  the  saf^  of  the  overall  laboratory  operation  was  an  inqwrtant 
objective  of  the  MORSE  project.  To  this  aid,  it  was  necessary  to  look  at  the 
laboratoiy  as  a  conqilete  system  made  up  of  hardware,  software  and  manual 
operations. 

Safety  analysis  at  the  system  level  can  typically  be  carried  out  in  the  following 
stages, 

system  definitimi, 
hazard  identificaticm, 
hazard  analysis, 
risk  analysis  and  assessment. 

The  exact  requirements  and  degree  of  detail  to  which  the  analysis  is  cmiducted  is 
likely  to  be  affected  by  fectors  such  as  the  safety  criticality  of  the  system  being 
considered,  the  financial  and  human  resources  available,  the  stage  of  development 
and  the  timeacale  of  die  project. 

The  need  for  improvements  to  the  syston  design  and  operation  are  usually  identified 
during  all  stages  of  the  safety  analysis. 

The  experience  gained  in  the  tystem  definition  and  hazard  idoitification  stages  of  the 
project  are  discussed  below. 
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4.  System  Definitioii 

The  first  stage  of  the  safety  analysis  of  the  CBD  was  to  produce  a  clear  definition 
of  the  i^stem  under  consideration,  its  boundaries  and  its  intended  mode  of  operation. 
Past  experience  of  applying  safety  analysis  techniques  in  other  industries  indicates 
that  the  techniques  can  be  best  applied  if  the  system  is  described  in  terms  of  a 
number  of  related  modules  (hardware  or  functional  blocks)  on  a  flow  diagram.  The 
existing  documratation  at  the  laboratory  did  not  describe  the  system  in  this  form  and 
therefore  a  new  representation  of  the  system  was  produced.  This  representation, 
aimed  at  capturing  the  main  activities  of  the  hardware  installed,  sample  handling 
procedures  etc.  (human  -  computer,  computer  -  computer,  computer  -  black  box, 
black  box  refers  to  a  computer  hardware  and  software  package  sold  to  the  laboratory 
by  an  external  supplier). 

A  description  of  the  CBD  system  was  therefore  produced  in  the  form  of, 

i)  serum  sample  flow  diagrams, 

ii)  data  flow  diagrams, 

iii)  functional  block  diagrams, 

iv)  hardware  interconnections, 

v)  descriptive  text  (purpose  of  each  module,  inputs  and  outputs,  description  of 

its  function). 

Example  r^resentations  of  the  system  are  presented  in  Figures  1 ,2. 

It  was  then  necessary  to  investigate  possible  fjulures  and  potential  consequences  m 
a  systematic  manner.  This  was  carried  out  through  a  Failure  Modes,  Effects  and 
Criticality  Analysis  (FMECA),  and  Hazard  and  Operability  (HAZOP)  Study  [7,8]. 
The  developmoit  and  use  of  each  of  these  techniques  is  discussed  below. 

5.  Failure  Modes,  Effects  and  Criticality  Analysis 

(FMECA) 

FMECA  is  a  technique  that  can  be  applied  to  a  system  which  can  be  broken  down 
into  individual  components.  The  components  can  be  hardware  blocks  or  functional 
blocks.  The  methodology  requires  the  assessor  to  have  a  clear  understanding  of  the 
iimction  of  each  conqwnoit  along  with  all  the  inputs  to  and  outputs  from  it. 

The  failure  modes  of  each  component  can  then  be  investigated  in  a  systematic  and 
rigorous  manner  to  establish  the  causes  and  the  effects  of  the  failure.  This 
informaticn  is  recorded  on  a  form  which  is  designed  to  collect  infonnation  to 
establish, 

how  each  component  can  fail, 
what  the  causes  of  failure  are, 
what  the  effects  of  failure  are, 
how  critical  the  effects  are. 
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how  often  the  failure  occurs. 

5.1  Experience  of  Applying  FMECA  To  the  CBD  System 

Making  a  judgement  about  the  criticality  of  an  idoitified  failure  mode  has  hem  the 
subject  of  further  development  within  the  project  in  order  to  ensure  that  the  team 
performing  the  FMECA  had  a  consistmt  i^roach.  A  criticality  rating  siystem  was 
developed  to  score  the  attributes  of  data  in  a  rq>resentative  way.  It  was  thought  that 
"units  of  data"  in  the  laboratory  enviromnent  con^rised  integrity,  flowrate  and  the 
effects  of  these  on  the  system  as  a  whole.  Account  was  not  taken  of  factors  extenud 
to  the  laboratory  such  as  the  state  of  the  patient  and  whether  the  doctor  had  made 
a  correct  diagnosis. 

The  attributes  of  data  were  scored  as  follows. 

Integrity  Rating  (A)  Degree  to  which  integrity  is  lost  for  individual  unit  of 

data  (categories  0,1,2), 

Flowrate  rating  (B)  Delay  to  flow  of  data  through  the  component  being 

investigated  caused  by  the  failure  of  that  component 
(Category  0,1,2). 

System  Effects  (C)  Likely  effect  on  data  leaving  the  overall  tystem  taking 

into  account  any  recovery  mechanism  (Categories  0,1,2) 
Failure  rate  (D)  Frequency  with  which  the  failure  is  likely  to  occur  (Category 
1,2,3) 

The  criteria  for  scoring  0,1,  or  2  -  was  set  such  that  any  score  represented 
iqrproximately  equal  importance  within  each  of  the  attributes  A,B  or  C. 

All  the  above  aspects  have  been  combined  together  in  the  following  marmer  to 
establish  a  total  criticality  rating  for  the  failure  mode  identified. 

(A  +  B  +  C)  D 

The  iqrpropiiateness  of  the  rating  system  can  only  be  assessed  through  the  use  of 
engineering  judgement. 

The  criticality  rating  thus  established  was  then  used  to  prioritise  the  hazards 
identified  and  the  recorrunendations  nude  for  improving  the  design  and  operation  of 
the  system. 

5.2  Results  From  FMECA 

Table  1  is  an  extract  from  the  full  FMECA  that  was  undertaken  and  only  shows  the 
details  of  the  criticality  analysis.  It  refers  to  one  of  two  external  hard  disks  attached 
to  the  file  server,  for  which  three  failure  modes  are  shown.  The  second  entiy  is 
described  as  follows:  if  one  hard  disk  fails,  due  to  hardware  error  (failure  cause), 
the  affected  drive  cannot  store  data  (local  consequ«ices)  and  since  the  disk  is 
"mirrored"  by  an  identical  disk  to  which  it  is  paired,  information  from  the  backup 
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disk  is  used  automatically.  Consequently,  there  is  no  data  comq>tion  (integrity 
rating  0)  and  no  impaimimt  of  data  flow  (flow  rate  rating  0).  There  are  no  system 
cmsequmces  and  tUs  failure  occurs  less  than  once  per  month  (failure  rate  1),  giving 
a  total  rating  of  0. 

In  example  2,  sanq)le  tubes  receive  bar-code  labels  bearing  critical  information  about 
the  identity  of  the  patimt  from  whom  the  sample  was  drawn.  Four  failure  modes 
are  shown  only  the  first  of  which  will  be  expliuned  in  some  detail  as  follows.  If  the 
wrong  label  is  attached  (failure  mode)  due  to  human  error  (failure  cause),  the  sample 
will  be  wrongly  idoitified  (local  consequmces).  The  data  is  corrupted  (integrity 
rating  2)  but  data  flow  through  the  conqwnent  is  unimpaired  (flow-rate  rating  0). 
The  error  may  be  idmtified  at  a  later  stage  (system  rating  1)  and  such  problems 
occur  af^tfoximately  once  a  wedc  (failure  rate  3)  resulting  in  a  total  rating  of  9. 

These  examples  appear  to  reflect  engineering  intuition  about  the  relative  criticality 
of  the  failure  modes  discussed. 

6.  Hazard  and  Operability  (HAZOP)  Study 

The  HAZOP  study  technique  was  initially  developed  in  the  1960’s  and  1970*s  within 
the  Mond  Division  of  ICI  for  application  within  the  process  industries.  The 
technique  consists  of  a  critical  and  systematic  review  of  the  system  under 
c<»isideration  by  a  multi-disciplinary  team.  The  review  is  coordinated  by  a  chaiiman 
who  leads  the  investigation  into  sections  of  the  system  with  the  aim  of  establishing 
their  design  intent  and  the  ways  in  which  that  section  can  deviate  from  the  defined 
design  intmt.  The  deviations  from  design  intmt  are  investigated  in  a  systematic 
manner  by  application  of  a  number  of  guide  words  such  as  noote,  less,  no.  These 
are  described  in  mote  fully  in  [8]. 

6.1  Experience  of  Applying  HAZOP  to  the  CBD  System 

Past  experimce  in  applying  the  HAZOP  technique  is  primarily  in  the  process 
industry  where  the  guide  words  are  used  to  investigate  deviations  in  parameters  such 
as  temperature,  pressure  and  flow.  Clearly  these  parameters  are  not  relevant  to  the 
operations  of  the  CBD.  It  was  therefore  necessary  to  apply  the  basic  guide  words 
to  activities  (e.g.  input  password)  or  functions  (e.g.  create  Print  file  from  Day  file) 
being  performed. 

6.2  Results  From  HAZOP 

Table  2  is  an  extract  from  the  HAZOP  study  and  presents  an  example  of  a  hazard 
identified  within  the  electronic  data  transfer  operations.  This  situation  refers  to  the 
transfer  of  test  requests  from  the  file  server  (Figure  2)  to  one  of  the  analyzer 
workstations.  A  program,  called  from  DEI,  copies  records  relevant  to  the  analyzer 
to  a  print  file.  If  the  hard  copy  print-out  is  considered  satisfactory  by  the  aiudyzer 
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opeiator,  a  iurther  transactional  file  is  created  from  which  the  records  are 
downloaded  to  the  afq^ropriate  analyzer  workstation.  Once  the  records  have  bera 
downloaded,  all  re(x>fxls  pertaining  to  this  analyzer  are  flagged  in  order  to  prevoit 
downloading  of  duplicate  records. 

In  die  example  presented  in  Table  2  the  activity  being  considered  is  the  flagging  of 
fields  widiin  the  dayfile  after  the  work  has  beoi  accepted  by  the  operator.  The 
purpose  of  this  activity  is  to  prevont  the  same  woik  requests  being  sent  forward  with 
future  batch  transfers. 

A  scmario  has  beoi  idmtified  where  more  records  are  flagged  than  should  be  when 
the  operator  accqits  the  printout  of  woric.  The  cause  of  this  is  additional  work  for 
the  analyzer  in  question  being  added  to  the  dayfile  from  the  other  data  entiy  terminal 
(DE2)  after  the  request  for  the  printout  has  be«3  initiated.  This  will  mean  that  some 
samples  will  remain  untested. 

The  software  to  carry  out  the  above  operation  can  be  easily  modified  to  avoid  the 
loss  of  work  requests  caused  by  this  scenario. 

Example  2  is  an  extract  taken  from  the  HAZOP  on  the  LIMS  hardware  which  is 
shown  in  Figure  3.  All  node  terminals  are  connected  by  cables  to  a  junction  box 
(multi'fimction  access  unit)  which  is  in  turn  connected  to  the  file  server  conqiuter. 
The  file  server  is  a  stand  alone  PC  with  external  hard  disks  which  are  mirrored  by 
Novell  netwoiking  software. 

In  this  exanple  the  deviations  from  design  intent  of  the  junction  box  have  been 
considered.  A  scenario  has  been  identified  under  the  guide  woric  "no”  where  the 
failure  of  data  flow  through  this  section  of  the  system  would  result  in  the  whole 
network  becoming  inoperable.  Action  has  been  recommended  to  investigate 
contingmcy  measures  in  place  to  recover  from  such  a  failure. 

7.  Discussion 

It  was  found  that  the  degree  of  scrutiny  of  the  system  possible  as  part  a  hazard 
idoitification  exercise  depended  on  the  amoimt  of  detail  included  in  the  system 
definition.  A  number  of  revisions  of  the  system  definition  were  necessary  to  enable 
assessmoit  to  an  adequate  level  of  detail  so  that  useful  results  could  be  obtained 
from  the  study. 

The  inqxirtance  of  ^stem  definition  is  stressed  particularly  since  the  application  of 
safety  analysis  in  other  such  laboratories  will  require  a  considerable  amount  of  effort 
to  define  the  ^stem  in  a  form  that  will  facilitate  such  analysis. 


The  af^lication  of  FMECA  and  KAZOP  to  the  operations  of  the  CBD  was  generally 
regarded  to  be  a  successful  exercise.  Each  of  the  techniques  has  its  advantages  and 
disadvantages  which  are  discussed  further  below. 


The  main  advantage  of  the  FMECA  as  compared  with  the  HAZOP  was  that  it 
raabled  a  criticality  rating  to  be  ^plied  to  the  hazards  identified  and  hence  eased 
prioratisation  of  the  recommendations.  In  the  case  of  the  CBD  the  analysis  was 
carried  out  by  a  number  of  people  working  on  their  own.  It  was  found  that  the 
progress  of  the  analysis  and  degree  of  scrutiny  depended  significantly  on  each 
person’s  understanding  of  the  system.  It  may  be  possible  to  in^rove  the  progress 
rate  and  the  degree  of  detail  by  employing  small  teams  of  personnel  according  to 
their  knowledge  of  the  con^)onaits  being  investigated. 

The  HAZOP  study  was  performed  by  a  team  of  people  with  varied  knowledge  about 
safety  analysis  techniques  and  the  CBD  system.  In  this  case  it  was  found  that  the 
written  system  definition  could  be  considerably  enhanced  by  the  knowledge  of  the 
team  members  during  the  study.  The  HAZOP  approach  was  found  to  allow  much 
more  free  thought  about  potential  failures  and  brace  was  better  at  picking  up  failures 
resulting  from  combinations  of  events. 

The  role  of  the  chairman  in  stimulating  thought  within  the  HAZOP  team  was  seen 
as  critical  in  the  application  of  the  technique.  The  lack  of  previous  experience  of 
applying  HAZOP  in  a  clinical  laboratory  environment  resulted  in  a  number  of 
teething  problems  to  begin  with.  These  were  primarily  carised  by  the  use  of 
iniq)propriate  guide  words. 

The  iq)proach  of  using  the  basic  set  of  guide  words  q>pli6d  to  activities  and  functions 
appeared  to  work  well.  This  approach  can  be  recommended  as  a  good  starting  point 
whra  iqiplying  HAZOP  in  an  industry  where  its  use  is  untried. 

8.  Where  Next 

The  application  of  the  safety  analysis  techniques  has  been  very  successful  on  the 
CBD.  It  is  however  recognised  that  a  considerable  amount  of  effort  went  into  the 
prq>aration  stages  where  the  system  was  defined  in  a  form  that  would  easily 
facilitate  safety  analysis. 

With  this  constraint  in  mind  it  is  felt  that  the  use  of  safety  analysis  techniques  can 
be  more  readily  acconunodated  in  new  developments  where  the  system  can  be 
defined  in  the  required  form  at  the  outset  as  part  of  the  overall  documentation  of  the 
project. 

It  is  thought  that  the  use  of  HAZOP  is  more  conducive  to  existing  tystems  where 
the  system  definition  can  be  rahanced  during  the  study  by  the  knowledge  of  the 
HAZOP  team  members. 
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Figure  2  -  Computer  Network  -  Date  Flow  Diagram. 


The  Benefits  of  SUSI:  Safety  Analysis  of 
User  System  Interaction 


M  F  Chudleigh  and  J  N  Qare 
Camtridge  Consultants  Limited 
Cambridge,  England 


1  Introduction 

The  use  of  computer  based  systems  has  many  advantages,  including  increased 
functionality,  inqeaaed  flexibility  and,  hopefully,  ease  of  use.  Because  of  diese 
advantages  their  use  is  increasing  dramaticaUy,  including  applications  where  failure 
could  have  an  adverae  impact  on  safe^.  It  is  inqxvtant  to  remember  diat  most  such 
systems  have  contact  with  human  users  and  are  used  in  organisations  where  there  are 
set  procedures  and  ways  of  working. 

Fbr  industries  such  as  petro-chemical,  rail  transport  and  air  transport  the 
consequences  (tf  a  fiaihire  could  be  catastrc^ihic  and  th^  ^ve  realised  that  equating 
and  controlling  die  risk  to  humans  arising  fimn  failures  ttf  such  systems  is  vital. 
The  process  of  evaluating  the  possible  safety  failures  is  known  as  hazard  analysis. 
One  particular  method  of  canying  out  hazard  analysis  is  the  Hazard  and  OperaMlity 
Study  (HAZOF)[l]. 

In  recent  years  there  has  been  an  increasing  realisation  dud  considering  the  place 
the  human  in  a  system  Ouunan  hictors)  riiould  be  an  important  pert  of  the  system 
design.  The  ciidcal  issue  is  that  as  systems  become  more  complex  the  human 
(qierBtan  are  increasingly  prone  to  ‘enor*.  in  this  case  error  is  used  to  describe 
behavioori  that  were  not  the  designer's  inrent  In  part,  such  behaviours  occur 
because  the  opemtors  are  not  able  to  comprehend  their  role  with  the  system. 
However,  a  key  part  is  that  the  operrtors  choose  to  do  something  different  to  the 
designer's  inrent  because  of  new  working  procedures,  or  because  the  ^stem  did  not 
perform  as  the  deagners  intmded.  However,  once  an  incident  occurs  thmi  we 
normally  identify  human  error  as  the  root  cause  [2]. 

Whra  we  consider  the  design  and  implementation  of  such  computer  based 
systems  we  find  that  at  least  three  qiecialists  are  involved:  those  who  design 
f^tionality  into  systems  (qiplication  designers);  those  who  design  the  user 
inteifBce  (human  computer  interaction  (HCI)  qiedalists);  and  those  human  factors 
specialists  who  examine  how  particular  tasks  are  carried  out 

In  order  to  build  effective,  safe  systems  it  is  clear  diat  an  three  specialists  must 
be  sMe  to  communicate  with  each  o^.  Ifowever,  the  reality  on  most  systems  is 
dial  the  diiee  areas  aU  have  their  own  specialist  vocabulary  and  mod^  oi  the 
system,  and  diey  tend  to  worit  indqtendently  each  other.  In  addition,  industries 
which  have  an  established  record  of  buildi^  safety  critical  systems  are  likely  to 
have  specialists  in  safety.  These  personnel,  again,  have  dieir  own  particular  jargon 
and  often  may  not  be  familiar  with  the  techniques  of  developing  computer  based 


gystems.  This  lade  of  cO'Kxdiiiated  covoage  during  system  developmmit  has  the 
I»leotial  to  lead  to  haztfdous  situatitms. 

We  have  developed  aniyjproach  that  we  believe  shows  promise  in  dealing  with 
the  above  problem.  The  a^iroach  is  called  SUSI,  standing  for  Safety  analysis  of 
User  System  lateraction.  SUSI  comprises  two  parts: 

•  A  common  iq^resentation  of  all  entities  in  a  systems  so  that  communicatimi 
between  q)eciidists  is  enaUed;  coupled  whh 

•  a  structured,  hazard  analysis  procedure  which  addresses  features  that  are  particular 
to  human-machine  intaractkms 

The  remainder  rrfdiis  paper  describes  the  iqpproach  and  gives  examples  of  its  use  in 
two  different  q)plications  at  different  points  of  the  system  lifecycle.  One  is  of  a 
new  medical  system  design  and  the  other  is  of  an  q)etati(»ial  maritime  system. 

2  SUSI:  A  Common  Representation 

In  describing  our  work  in  developing  a  common  representation  for  systems  we 
address  three  main  areas.  First,  the  key  realisation  that  a  conmum  rqvesentation  is 
bodi  necessary  and  possible:  it  underpins  the  majority  of  our  work  in  user  system 
interaction.  Second,  the  iqjplicability  and  limitations  of  the  chosen  approach. 

2.1  A  Common  Representation  is  Necessary  and 
Possible 

In  building  systems,  a  variety  of  expertise  is  necessary.  Software  iq>plication 
designers  lend  to  treat  the  user  as  a  separate  entity  from  the  system  and  leave  an 
external  souroe/sink  labdled  HCX  The  budding  (rf*  user  interfaces  is  then  treated  as  a 
sqnrate  design  activity  given  to  HQ  specialists  with  particular  techniques.  Human 
Factors  specialists  oftra  work  iqrart  fr^  syston  develop^s  and  concentrate  on  the 
activities  carried  out  by  humans  (task  analysis)  with  its  own  vocabulary  [3].  This 
analysis  may  weO  be  aimed  at  defining  manning  levels  or  training  requirements. 

In  developing  new  systems,  these  specialists  all  tend  to  build  their  own  models 
the  system  which  are  not  easy  to  correlate  with  the  other  modds.  In  addition,  the 
people  ^dio  want  the  system,  those  who  will  use  it  and  those  who  mi^t  have  to 
jnd^  its  safety  all  need  to  have  an  understanding  of  die  system.  However,  they  all 
do  need  to  communicate  with  each  other  to  ensure  they  share  a  common 
understanding  of  the  system.  Thme  needs  to  be  a  stable  system  view  from  each  of 
die  perqiectives.  Gai^g  that  understanding  is  not  easy  with  a  plurality  of  system 
modds  and  it  is  difficult  to  imagine  the  consequences  to  the  rest  of  the  system  of 
changes  in  one  rqvesentatioa. 

An  analysis  ot  the  features  that  need  to  be  described  in  understanding  human 
activities  [4]  and  those  used  to  describe  a  software  system  [S]  show  a  commonality 
of  entities: 


hmnantask 


sdtwaie  process 
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hunian  infonnatkm  flow  software  dataflow 

human  intenctions  software  control 

human  documeotatioa  software  database 

With  die  existing  wideqxead  use  of  software  dataflow  analysis  and  description 
toob  there  would  dius  appear  to  be  a  basb  for  a  common  represmitatioa.  In 
adopting  a  dataflow  and  process  model  for  the  human  compcments  of  a  system  we 
can  now  generate  an  integrated  rqnesentation  of  the  overall  system.  Using  dus  we 
can  explore  the  crmsequences  of  failure  in  a  consbtent  manner  across  the  whole 
system.  In  Figure  1.  part  of  a  medical  imaging  system  b  shown  to  illustrate  the 
convention  used  in  dus  type  of  rqnesentadon.  The  key  components  are  circles  to 
rqxesent  a  process  (human  (u  machine),  a  sdid  line  b  a  dataflow,  a  dashed  line  b  a 
control  flow  (start,  stt^  etc),  and  two  paralld  lines  rqiresent  a  data  store  (we  show 
diqdays  as  data  stores  because  data  may  be  written  to  a  screen,  but  there  has  to  be  an 
eiqili^  human  process  to  read  die  data). 


Figuro  1 :  Part  of  Decomposition  of  Preparation  and  Scanning  Process 

2.2  The  Scope  of  the  Modelling  Approach  is  Wide,  but 
There  are  Limitations. 

The  iq;iprooch  works  well  as  long  as  the  human  activity  to  be  modelled  b  primarily 
information  intnisive.  Thus  activities  such  as  decision  taking,  information  transfer 
and  classification/scffting  are  good  matches,  such  as  in  die  medical  example  above. 
The  vast  majority  of  human  work  falb  into  thb,  information  intensive,  class  but 
there  are  limitations.  The  approach  b  not  effective  when  the  human  processes  have 
significant  motor  akiib  or  introspective  reascming  componenb.  An  exampb  where 
motor  skilb  take  a  pimninent  part  might  be  the  interaction  between  the  driver  of  a 
Formub  1  racing  car  and  the  active  computer  systems  used  to  manage  many  of  die 
car  mechanical  functions. 


We  have  found'that  a  dataflow  and  process  model  for  both  the  human  and  the 
coo^Miter  parts  of  a  system  allows  us  to  generate  an  integrated  representation  of  die 
overall  system.  This  model  can  then  be  used  to  explore  various  properties  of  the 
system.  A  miyoi^  advantage  we  have  found  is  that  such  mo^ls  ate  easy  to 
understand  by  non-computer  specialists  and  that  users  of  systems  are  able  to 
comment  on  and  critique  designs  at  a  very  detailed  level 

In  the  next  section  we  show  how  the  consequences  of  failure  may  be  examined 
in  a  consistent  manner  across  the  whole  systens. 

3  SUSI:  Hazard  Analysis  Using  Amended  HAZOP 

An  qiproach  to  assuring  safety  of  systems  has  been  well  established  over  many 
years.  However,  only  in  more  recent  years  has  the  potential  for  computer  systems 
to  ful  in  ways  that  mi^t  impact  safety,  been  recognised  formally  in  standards  work 
[6,7].  To  facilitate  die  discussion  two  definitimis  from  Reference  [6]  are  given: 

A  haiard  is  a  physical  situation  with  a  potmtial  for  human  injury. 

Risk  is  the  OHnbinatiQn  of  the  frequency,  ct  inobability,  and  the  cmisequences  of  a 
specified  hazardous  event 

Both  emerging  standards  and  existing  jnacdce  in  established  industries  uses  the 
Mine  basic  lifecycle  iqiproach  to  addressing  safety  which  may  be  summarised  as 
follows: 

•  System  definition  generating  a  concise  and  cmnpletedescrq^tion  ofthesystem 
under  review 

•  Hazard  analysis  to  identify  the  potential  fix’  hazardous  events 

•  Risk  analysis  to  judge  the  safety  risk  of  tte  system  as  defined 

•  Risk  accqitability:  determination  of  whether  the  risks  are  accqitable 

•  Activities,  if  necessary,  to  modify  the  syston  definitimi  or  include  additimial 
measures  in  order  to  reduce  die  risk  to  an  accqrtable  level 

Hazard  analysis  is  thus  a  key  st^  in  the  process  and  a  number  of  structured 
procedures  have  been  developed  to  provide  ct^dence  that  the  hazard  arudysis  is 
omiplete  and  through.  This  section  introduces  (Hie  of  the  main  hazard  aiialysis 
methods,  HAZOP;  explains  how  we  have  extended  the  HAZOP  to  address  user 
system  interaction;  and  describes  some  advantages  and  limitations  of  the  approach. 

3.1  What  is  a  HAZOP? 

The  full  name  oi  HAZOP,  Hazard  and  Operability  Study,  says  a  great  deal  about  its 
purpose.  It  is  to  ensure  both  that  necessary  features  are  incorporated  in  a  design  to 
provide  for  safe  operation  and  that  features  are  avoided  which  could  give  undesirable 
outcomes  Qc  hazards).  The  technique  was  devek^ied  by  ICI  in  die  late  1960's  and 
has  grown  to  be  well  established  in  the  petrochemical  industries.  An  excellent 
introduction  to  the  technique  is  given  in  [1]. 

In  die  process  indusfries  it  is  usual  to  describe  plant  designs  in  the  form  of 
Paring  and  Instrumentation  Diagrams  (PAIDs).  The  HAZOP  is  carried  out  by  a 
small  team  with  the  following  members:  temn  leader,  team  secretary;  personnel  who 
have  detailed  knowledge  of  operati(Hi  of  similar  systems;  personnel  who  have 
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knowledge  of  die  design  intent  of  die  system;  ^lecific  technical  specialists 
isnecessary.  The  team  woik  logically  dirotigh  the  P  &  IDs.  examining  deviations 
from  normal  operatkm  asking  **can  the  deviation  luq;)pen?”  and  if  so,  ‘Vould  it  cause 
a  hazard?**  (a  hazard  could  be  things  such  as  a  fire  or  release  of  toxic  material).  To 
guide  the  process  a  series  of  guidewmds  and  potential  deviadtms  are  used.  Thusfor 
liquid  in  a  pqie,  a  relevant  guideword  is  **fiow**  and  potential  deviations  are  **high, 
low,  no,  reverse”.  For  fluids  anodier  guideword  is  *‘inessure”  with  deviations  **higli, 
low”.  In  dwoty,  each  guideword/deviation  should  be  qiplied  to  each  process  line 
and  vessel .  In  practice,  an  experienced  team  leader  will  judge  the  correct  detail  of 
questioning  for  each  area. 

3.2  Modifying  HAZOP  for  User  System  Interaction 

A  critical  element  of  the  HAZOP  process  is  the  choice  of  parameter  keywords  and 
guidewords  for  deviations.  We  bdieve  the  guidewordsftv  petrochemical  plant  have 
limitatitMis  when  addressing  crxnputer  based  systems  and  user  system  interaction. 
Other  work  1^  Camteidge  Ccxisultants  and  their  parent  company  Artiiur  D  Little  has 
led  to  modificatkns  the  HAZOP  tqjproach  for  computer  based  systems  and  this  is 
rqxxted  in  [8]  and  [9]. 

For  our  wok  with  user  system  interacticm  we  have  developed  a  vocabulary  of 
discrete  entities  which  have  associated  deviations.  These  are  shown  below 


Entity 

Deviation 

Comments 

Process 

Failure 

Error 

Wrong  Process 
Intemqaed 

Executicm  fails,  data  is  used  inappropriately 
Process  algorithms  wrcmg  or  contain  flaw(s) 
Wrong  process  selected  or  human  short  cut 
Process  not  restarted  appropriately 

Data  Flow 

Connqpled 

None 

Wrong  source/sink 

Data  changed  in  transit 

Data  does  not  exist 

Data  takeit/sent  fromAo  wrong  place 

Data  Store 

Corrupted 

None 

Data  changed  in  store 

Data  not  stc^  or  not  found 

Control  Flow 

Corrupted 

None 

Wrong  source/isink 

Wrong  oxitrol  signal 

Control  does  not  exist 

Sent  toAecdved  from  wrong  place 

We  have  found  that  the  traditional  HAZOP  team  structure  and  general  approach  can 
be  used  without  change.  However,  it  has  been  found  essential  to  have  indq)endent 
technical  personnel  who  are  experienced  in  system  design  and  human  factors  as  part 
of  the  HA^P  team. 


3.3  Advantages  and  Limitations  of  the  Modified 
HAZOP 

The  main  advantages  of  the  iy)proach  are:  it  is  done  by  a  team;  it  gives  a  top-down 
approach  to  die  system;  and  it  can  be  used  both  on  new  system  designs  and  on 
existing  systems. 

The  team  approach  brings  a  vsiety  of  expertise  and  viewpoints  <mto  a  common 
problem  and  concentrating  on  hazard  idmtification  leads  to  productive  sessions. 
Also,  die  team,  by  providing  a  variety  of  viewpoints  helps  to  avoid  excessive 
investigation  of  non-credible  hazards. 

The  top-down  approach,  examining  tte  whole  system  first,  allows  the  homing 
in  to  key  issues  based  on  the  potential  hazards  and  is  very  good  at  assessing  system- 
wide  implications.  The  iqiproach  of  loddng  at  deviations  from  design  intent,  then 
their  causes  and  consequences  encourages  exploration  of  non-obvious  interactions 
bodi  of  the  user/operator  with  the  automated  system  and  of  the  automated  system 
with  its  hardware  envircxunent  The  use  of  a  HAZOP  provides  guidance  towaids  die 
most  critical  areas  to  ccmcentrate  on  in  any  subsequent  low  level  investigation. 

The  HAZOP  fits  naturally  at  all  stages  of  the  life  of  a  system,  firom  concqH 
through  to  operation.  In  later  sections  we  give  an  example  of  use  during  a  medical 
system  conceptual  design  stage  and  another  example  of  analysing  an  existing 
operatMXial  maritime  system. 

There  are,  however,  limitations  to  the  HAZOP  qq[)roach.  We  have  found  that 
strai^tforward  qiplicatum  of  the  deviation  guidewofds  is  not  sufficient:  the  process 
relies  on  the  experience  and  intuition  of  the  team  members  (especially  of  the 
mdependent  technical  eiqxrts).  FurdKar,  the  chmce  of  an  experienced  HAZOP  leader 
iskey.  It  is  the  leader  who  ctmtrols  the  pace  of  the  analysis  and  it  takes  significant 
experirace  to  guide  the  team  discussion  to  the  most  critical  areas  while  still 
ensuring  fiiU  coverage  within  usually  tight  time  constraints. 

4  The  Use  of  SUSI  in  a  New  System  Design:  a 
Medical  Laboratory  System^ 

This  section  is  divided  into  three  parts:  first  a  brief  description  of  the  system  to 
pardy  automate  screening  of  comical  specimens,  then  an  outline  of  the  dataflow 
description  and  finally  some  of  the  results  from  the  HAZOP  analysis. 

4.1  The  Medical  Imaging  System 

The  Human  Genetics  Unit  (HGU)  of  the  Medical  Research  Council  in  Edinburgh 
have  produced  an  experimental  version  of  a  semi-automated  system  for  screening  of 
cervical  smear  sanq)les  to  identify  abnormalities  which  might  lead  to  cancer.  The 


l.The  work  carried  out  here  wai  part  of  a  collaborative  project  between  The  Centre  for 
Software  Engineering  Limited,  Cambridge  Consultants  L^ted  and  the  Human  Genetics 
Unit  of  the  Medical  Research  CounciL 
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HGU  have  been  working  closely  with  the  Department  of  Pathology  of  the 
UniversiQr  of  Edinburgh  to  carry  out  trials  on  the  system  and  both  parties  were 
closely  iiwolved  in  the  wok  presented  here. 

Within  a  cytology  laboratory  the  equipment  would  consist  of  two  major  parts,  a 
robot  slide  preparation  system  ^SPS)  and  a  slide  scanning  system  (SSS).  The 
RSPS  takes  in  sample  bottles  submitted  by  clinics  and  transfers  a  part  of  the 
material  as  a  monolayer  sample  (mio  slides. 

The  SSS  is  an  image  processing  system  iKdiich  inspects  objects  on  die  slide  and 
classifies  them  into  various  categories.  Where  abnormal  objects  ate  identified,  the 
system  stores  digitised  images  for  subsequent  human  inspection.  Both  die  above 
systems  are  supported  by  a  computer  based  system  providing  overall  administiatitxi 
1^  intet&ces  to  the  laboratory  main  computer  which  stores  patient  records. 

4.2  Development  of  the  Dataflow  Description 

The  full  system  descr^on  is  far  too  long  to  be  included  here:  we  give  simplified 
versions  of  two  levels  of  the  description  to  illustrate  the  method.  The  system 
context  identifies  the  complete  system  under  consideradmi  and  its  principle 
interactions  with  the  extnnal  WOTld.  Here  there  are  two  external  entities;  the  clinics 
or  surgeries  which  collect  samples  and  have  reports  returned  and,  within  the 
laboratory,  the  archives  where  reptms  and  samples  are  stored.  Note  that  the  total 
screening  system  includes  the  human  administrators,  technicians  and  medical 
persnmel  who  interaa  with  the  machine  sub>systems. 
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The  tq?  level  system  is  then  decomposed  whilst  preserving  the  external  data 
flows.  Figure  2  shows  the  division  into  four  ]vocesses:  sample  and  form  validation* 
prqaration  and  scanning,  review  process,  and  laboratory  computer  data  management 
Figure  1  (shown  earlier)  gives  tte  decomposidra  of  the  preparadon  and  scanning 
process.  In  discussing  the  design  of  the  system  it  became  clear  that  process  2.2 
Match  items  was  a  key  and  human  intensive  ivocess.  It  was  necessary  to  include  an 
explicit  process  because  the  automated  part  of  die  system  had  few  qpportunides  to 
cross  check  for  errors  in  die  pairing  of  padent  forms  and  padent  sample  botdes.  As 
the  design  proceeded  it  was  also  recognised  that  there  were  other  quality  ccmtrol  and 
sorting  processes  that  could  also  be  carried  out  at  this  stage  of  die  overall  process. 
As  a  result  the  Match  Items  process  is  an  intensive  human  process  which  achieves 
the  following:  cross-checking,  visual  quality  assessment  of  prqiared  slides, 
attachment  of  label  to  each  slide  and  assignment  of  samples  f(v  priority  processing 
where  there  are  clinical  indications.  To  aid  the  human  carry  out  die  process  a  screen 
is  used  to  display  status  informadon  and  a  barcode  reader  pen  is  used  to  cross-check 
barcoded  items  against  padent  details. 

The  dataflow  design  was  produced  by  a  team  of  a  consultant  (mie  of  the 
authms).  a  cytopathologist  and  a  member  of  the  design  team  of  an  experimental 
automated  system.  The  team  qiproach  ensured  that  the  various  viewpoints  mi  the 
design  were  ogitured  in  an  effective  manner  and  partitioned  activities  b^een  human 
and  machine  in  a  way  that  was  readily  understwdable  to  all  those  involved.  This 
level  of  common  understanding  had  not  been  achieved  previously  despite  regular  and 
detailed  liaison. 


4.3  Application  of  the  modified  HAZOP 


The  HAZOP  team  included  representatives  of  both  the  design  team  and  the  intended 
users  of  the  system.  The  IIA20P  leader  role  was  taken  by  a  member  Cambridge 
Consultants  who  had  no  direa  involvement  in  generation  of  the  dataflow  design  but 
was  aware  of  the  overall  goals  of  the  project  The  design  consultant  provided  the 
technical  expertise  in  system  design  and  human  factors. 

The  HAZOP  process  focused  on  two  types  of  issue.  One  of  these  related  to 
design  aspects  which  would  enhance  operability  and  safety,  the  other  was  fircrni  the 
expmience  of  current  staff  who  were  able  to  identity  practical  and  procedural 
constraints  which  would  impact  the  viability  or  safety  of  the  system.  Below  we 
give  some  of  the  observations  recorded  during  analysis  of  the  Match  Items  process. 


Recommendation  1.  Bar  coded  form  wrong  through  mismatch  of  names. 
Wrong  patient  gets  matched  to  slide.  Do  double  check. 

Recommendation  3.  No  report/label  because  detached.  Leads  to  delay.  Print 
report  and  label  on  same  form. 

Recommendation  4.  Wrcnig  slide  pair  because  of  incorrect  location  of  slide. 
Wrong  patient  could  be  matched.  Check  barcode  and  accession  number  match. 


I 

i 


131 


5  The  use  of  SUSl  in  an  Existing  System: 
Maritime  Control 

A  hazard  analysis  has  been  carried  out  oa  the  navigation  systems  for  a  vesseL  This 
was  part  of  a  series  of  investigatioos  into  hazards  associated  with  coastal  traffic 
openidng  in  crowded  waters.  lliecAiject  was  to  identify  areas  iriwre  hazards  could  be 
expected  to  have  significant  effects.  The  SUSI  meth^logy  was  used  as  die  basis 
for  the  analysis.  The  first  stage  was  the  ptoduedon  of  dataflow  descripdons  of  the 
activities  of  watch  keeping  officers.  This  was  dtne  on  a  series  of  sea  voyages 
during  whidi  die  dataflow  models  were  c(»stnicted  and  reviewed  by  eiqierimiced, 
seaman  officers.  Hgure  3  shows  the  first  levd  decomposidon  of  processes,  most 
external  data  flows  have  been  <»iitted  to  aid  intelligibility.  As  can  be  seen  die 
ininc^  processes  are  navigadon,  ensuring  diat  you  are  where  yon  should  be; 
ctdlision  monitoring  to  identify  other  mobile  hazard  and  conning,  giving  orders 
for  changes  to  speed  and  course.  Each  of  these  processes  were  decomposed  to 

lower  levels  for  die  actual  analysis. 


Figures  Decomposidem  of  Ship  Navigadon 

On  completion  of  the  data  flow  analysis  a  HAZOP  team  was  assembled 
consisting  of  a  HAZOP  leader,  secretary,  dataflow/human  factors  expert  and 
eiqierienced  navigating  officers.  At  the  lowest  level  of  each  branch  of  the 
deoooqwsition  the  guide  words  were  used  for  every  data  flow,  process,  data  store  and 
contndflow.  Iheouqinit  was  recorded  uring  a  HAZOP  datriwse  tool.  TheHAZOP 
ouQNit  was  revised  post  team  session  in  order  to  ensure  consistaicy  of  reqxxises  and 
to  cover  areas  that  the  HAZOP  team  were  unable  to  detail  at  this  time  (ie  lack  of 
detailed  knowledge  of  the  operating  procedures). 


132 


The  next  stage  was  to  develop  fault  trees  with  top  events  of  collision, 
grounding  etc.  The  fault  trees  were  constructed  using  avail^le  data  on  navigation 
aid  failures  and  human  emv  rates.  Human  etitv  rates  were  Csctored  by  stress  level, 
complexity  and  experience.  The  human  error  rates  used  were  for  single  events  and  it 
was  noted  that  normal  navigational  procedures  require  multiple  independent 
sightings.  If  errors  ate  suspected  then  new  sightings  ate  taken.  Fbr  example  a 
visual  fix  requires  the  beating  at  tegular  times  of  three  fixed  points.  All  three  have 
to  coincide  for  an  einv  free  fix  to  be  recorded.  In  many  cases  this  would  then  be 
cross  cbedoed  with  a  radar  fix  or  a  navigatkxial  beacon. 

The  SUSI  approach  was  both  necessary  and  valuatde  for  diis  task  as  diere  were 
no  simple  descriptions  of  the  navigation  tasks  or  use  of  navigational  aids.  There 
were  a  series  of  operating  {vocedure  manuals,  the  training  matnials  for  ships 
navigatum  officers  and  various  instructions  to  mariners  issued  by  regulating  bodi^ 
In  order  for  the  HAZOP  team  to  operate  this  mass  of  data  had  to  be  compiled  into  an 
easily  digestible  form  which  could  also  be  validated  against  actual  qperadonal 
practice.  The  development  of  the  dataflow  descriptitm  provided  a  clear  and  easily 
understood  view  of  these  activities  (one  training  officer  has  decided  to  use  them  as 
navigatkmal  training  material). 

The  overall  process  of  hazard  analysis  was  completed  in  four  weeks,  frtnn 
boarding  the  first  ship  to  delivering  the  final  rqxirt 

References 

1  A  Guide  to  Hazard  and  Operability  Studies.  Qiemical  Industries  Association 
Limited,  1987. 

2  Westrum  R.  Technologies  and  Society.  Wadsworth  Inc.  1991 

3  Stammers  R,  Carey  M.  Astley  J:  Task  Analysis.  In:  Wilstm  J,  Corlett  E  (eds) 
Evaluation  of  Human  Work.  Taylor  &  Bancis  1990,  pp  134-160. 

4  Wilson  J,  Coriett  E  (eds)  Evaluation  of  Human  Work.  Taylor  &  Francis  1990 

5  Hatley  D,  Pirbhai  I .  Strategies  for  real-time  system  q)ecification.  Dtnset 
House,  1988. 

6  Functional  Safety  of  Electrical/Electronic/Programmable  Systems.  Generic 
Aspects.  lEC  6SA  (Secretariat)  123.  1991 

7  Interim  DefStan  00-56.  Hazard  Analysis  and  Safety  Classification  of  the 
Computer  and  Ptogranunable  Electronic  System  Elements  of  Defence 

8  Chudleigh  M,  Catmur  J.  Safety  Assessment  of  Computer  Systems  using 
HAZOP  and  Audit  Techniques.  In:  Frey  (ed)  Safety  of  Computer  Control 
Systems  1992  (Safecomp  *92)  pp  285-292. 

9  Chudleigh  M.  Hazard  Analysis  Using  HAZOP:  A  Case  Study.  In  these 
proceedings. 


Panel  Session 


TECHNOLOGY 
TRANSFER  BETWEEN 
ACADEMIA 
AND 

INDUSTRY 


Moderator:  F.  Redmill 
Safety  Critical  Systems  Club,  UK 


Issues  Affecting  Technology  Transfer 

and 

Experience  with  a  Community  Club 

Felix  Redmill 

Redmill  Consultancy 

and  Co-ordinator  of  the  Safety-Critical  Systems  Club 
22  Onslow  Gardens,  London,  NIO  3JU,  UK 


1  INTRODUCTION 

Achieving  technology  transfer  is  a  perennial  problem.  Yet,  without  it,  industry's 
problemscan  remain  unsolved,  and  solutions  to  one  problem  are  not  generalised 
to  more  or  broader  applications. 

There  are  two  objectives  of  the  paper.  The  first  is  to  raise  a  number  of  issues 
concerning  the  transfer  of  technology  from  academia  to  industry.  The  second 
is  to  show  how  a  community  club  is  helping  to  facilitate  technology  transfer. 

2  ISSUES  IN  TECHNOLOGY  TRANSFER 

Someof  the  typical  problemsaffecting  the  transfer  of  technology  fromacademia 
to  industry  are  listed,  with  brief  notes. 

2.1  Traditionally  there  is  a  delay  between  the  development  of  a  technology 
in  academia  and  its  implementation  in  industry.  Some  say  that  this  can  be  as 
long  as  ten  years.  Many  industries  have  no  formal  contacts  in  academia  and 
they  do  not  kiK)w  how  to  find  out  what  new  technologies  are  available. 
Moreover,  academia  is  not  noted  for  effective  marketing  of  its  products. 

2.2  The  terms  in  which  academics  typically  present  their  findings  are  those 
understandable  to  other  academics,  and  not  to  industrialists.  This  contributes 
to  the  delay  in  the  recognition  and  implementation  by  industry  of  new  tech¬ 
nologies. 

23  Even  when  transfer  does  occur,  it  is  usually  to  a  small  sector  of  industry, 
and  the  majority  of  those  who  need  the  technology  do  not  receive  its  benefit. 


2.4  Feedback  from  industry  to  academia  is  not  standardised.  It  is  seldom 
good  and  almost  always  slow.  Thus,  potentially  useful  but  flawed  or  unready 
technologies  are  rarely  corrected. 

23  Communication  between  the  users  of  a  technology,  particularly  across 
industrial  sectors,  is  at  best  poor  and  often  non-existent.  Thus,  d\e  lessons 
learned  by  one  company,  or  industry,  are  not  communicated,  and  others  must 
undergo  the  learning  curve  for  themselves.  When  the  technology  is  flawed, 
finding  the  problems  and  suffering  their  consequences  must  also  be  repeated. 

2.6  When  there  is  effective  marketing  of  new  technology,  for  example  by 
consultants,  it  is  not  uncommon  for  inappropriate  technologies  to  be  trans¬ 
ferred.  Consultants  often  transfer  their  pet  technologies  rather  than  those  most 
suitable  to  the  problem  in  hand. 

2.7  Typically,  technologies  are  developed  for  specific  applications  in  specific 
industry  sectors.  Often,  however,  they  are  proposed  as  being  effective  over  a 
much  wider  range  of  applications.  When  they  are  found  to  be  ineffective,  those 
who  have  used  them  may  be  left  suspicious  of  academic  innovations. 

2.8  Much  technological  development  in  academia  is  directed  towards  ob¬ 
taining  degrees  and  preparing  publications  rather  than  to  getting  the  technolo¬ 
gies  right.  The  resulting  technologies  may  then  not  be  properly  refined.  With 
luck,  such  technologies  do  not  get  transferred.  When  they  are  transferred,  they 
are  often  not  suitable  for  use,  and  industry  wastes  time  discovering  this. 

2.9  In  the  normal  course  of  events,  a  great  deal  of  technology  transfer  is  of 
proven  technologies  to  new  environments  rather  than  of  the  introduction  of 
new  technologies. 

3  THE  SAFETY-CRITICAL  SYSTEMS  DOMAIN 

Safety-critical  computer  systems  are  a  relatively  new  phenomenon,  but  the 
domain  is  expanding  rapidly.  Moreover,  it  is  not  growing  out  of  nothing,  but 
is  the  coming  together  of  a  number  of  other  fields.  The  two  main  components 
are  safety  engineering  which,  though  not  new,  has  traditionally  been  restricted 
to  a  small  number  of  industry  sectors,  and  software  and  systems  engineering, 
which  is  itself  a  new  field.  Inaddition,  safety-critical  systems  are  expanding  into 
almost  every  sector  of  industry,  and  demanding  input  from  other  specialisms, 
such  as  human  factors  and  quality  management.  This  leads  to  the  following 
observations  on  the  need  for  technology  transfer  in  the  safety-critical  systems 
domain. 


3.1  There  is  an  urgent  requirement  for  die  developmait  and  transferof  new 
technolog;ies  to  meet  the  particular  safety  demands  of  safety-critical  systems. 

3.2  Software  and  systems  engineers  are  typically  not  familiar  with  safety 
engineering,  and  there  is  a  need  for  the  transfer  to  them  of  existing  safety 
tedmologies. 

33  ^milarly,  diere  is  a  need  for  safety  engineers  to  become  familiar  with 
computer  ^tems  and  software  technologies  and  practices. 

3.4  The  knowledge  and  technologies  of  the  human  factors  domain,  on  such 
issues  as  human  dependability  and  human-computer  interaction,  urgently 
need  to  be  assimilated  in  the  safety-critical  systems  community. 

33  The  need  is  urgent  for  an  improved  awareness,  in  all  sectors  of  industry, 

of  the  application  of  safety-critical  systems. 

3.6  There  is  die  need  for  the  transfer  of  existing  technologies  and  practices 
between  industries. 

3.7  If  the  development  and  transfer  of  technologies  is  to  keep  pace  with 
expansion  in  the  domain,  there  needs  to  be  focused  research,  and  easily 
accessible  communication  links  between  industry  and  academia.  Improved 
communication  links  would  also  allow  transferred  technologies  to  be  im¬ 
proved  and  made  fit  for  purpose. 

33  Mechanisms  are  needed  for  communication  across  industry  sectors,  so 
that  experiences  of  new  technologies  can  rapidly  be  communicated. 


4  SETTING  THE  SCENE  FOR  A  COMMUNITY  CLUB 

Technology  transfer  may  take  place  in  a  number  of  ways.  For  example, 
individual  companies  may  make  contact  widi  one  or  more  universities  or 
research  establishments;  more  and  more,  universities  are  encouraging  the 
setting  up  of  'spin-off  companies  for  the  purpose  of  marketing  and  selling  their 
technologies;  professors  and  other  researchers,  in  their  roles  as  consultants,  are 
active  in  the  transfer  of  technologies,  and  they  carry  considerable  responsibility 
in  choosing  what  they  recommend  to  their  clients;  reports  and  publicity  may 
catch  the  attention  of  industry.  There  is  also  the  possibility  of  achieving 
technology  transfer  via  a  community  club. 

In  1991,  in  the  UK,  the  British  Computer  Society  (BCS)  and  the  Institution  of 


Electrical  Engineers  (lEE)  were  contracted  by  the  Department  of  Trade  and 
Industry  (DTI)  to  set  up  a  community  club  for  technology  and  information 
transfer,  and  for  raising  awareness,  in  the  safety-critical  systems  domain.  The 
two  Societies  were  assisted  by  the  Centre  for  Software  Reliability  at  the 
Univeraty  of  Newcastle  upon  Tyne,  who  in  turn  engaged  the  current  author  to 
be  the  club's  Co-ordinator.  The  club  was  launched  in  May  1991  and  the  high 
level  of  interest  in  it  was  demonstrated  by  the  attendance  of 255 delegates  at  the 
inaugural  meeting  in  July  of  that  year.  By  May  1993,  there  were  1682  members, 
of  whom  130  were  from  outside  the  UK. 


5  THE  CLUB'S  OBJECTIVES 

The  club  exists  to  facilitate  infonnation  and  technology  exchange,  and  to 
increase  awareness,  in  the  safety-critical  systems  domain.  It  is  recognised  that 
in  order  to  be  successful  in  this,  the  club  must  gain  access  not  only  to  engineers 
and  technicians  but  also  to  managers  with  decision-making  responsibilities. 

By  facilitating  communication  across  industry,  the  club's  objectives  are  to: 

•  Increase  (he  rate  of  dissemination  of  useful  technologies; 

•  Prevent  the  spread  of  flawed  technologies  by  the  rapid  communication 
of  experience; 

•  Improve  the  industrial  testing  of  new  technologies; 

•  Bring  industrialists  together  to  plan  feedback  to  academia  and  to  coordi¬ 
nate  the  sponsorship  of  research. 

By  facilitating  communication  between  academia  and  industry,  the  club's 
objectives  are  to: 

•  Improve  the  choice  and  application  of  technology; 

•  Accelerate  the  feedback  to  academia  of  experience  in  the  use  of  technolo¬ 
gies; 

•  Improve  safety-critical  computer  systems  which  are  supplied  to  indus- 
tiy; 

•  Facilitate  the  taigetting  of  research; 

•  Accelerate  the  correction  and  improvement  of  flawed  but  useful  tech¬ 
nologies. 

It  is  also  the  club's  objective  to  provide  a  platform  for  reporting  on  research  into 
new  technologies  aiui  experience  in  dieir  use  in  industry. 


6  SUCCESS  IN  MEETING  THE  OBJECTIVES 

6.1  Newsletter 


A  newsletter,  of  at  least  10  pages,  is  published  three  times  per  year  and 
distrilmted  to  all  members.  Tj^ical  contents  are: 

•  Feature  articles  on  safety-critical  systems  matters; 

•  A  calender  of  events  on  saf^-critical  systems; 

•  A  calender  of  events  on  related  issues; 

•  Calls  fw  papers  for  future  conferences; 

•  Reports  on  new  products; 

•  Reports  on  government  studies  or  initiatives  affecting  safety-critical 
systans; 

•  Comments  by  members  on  safety-critical  issues. 

6.2  Seminars 

By  June  1993,  theclubhad  held  nine  seminars.  Of  d)ese,  seven  were  one-day  and 
two  were  two-day  events.  The  topics  covered  were: 

•  Inaugural  meeting  and  introduction  to  safety-critical  systems; 

•  Requirements  for  safety-critical  systems; 

•  Education  and  training  for  safety-critical  systems  professionals; 

•  Safety-critical  software  and  technology  in  the  medical  sector; 

•  Standards  for  safety-critical  software; 

•  Human  factors  in  safety-critical  systems; 

•  E)esign  for  safety  and  reliability; 

•  Safety-critical  systems  in  the  nuclear  sector; 

•  The  safety  case. 

In  the  interest  of  bringing  industrialists  of  all  sectors  together,  the  majority 
of  the  seminars  are  on  topics  which  are  of  broad  application.  However,  it  has 
also  been  the  club's  policy  to  hold  one  sector-specific  seminar  each  year.  In  1992 
and  1993,  these  were  devoted  to  the  medical  and  nuclear  sectors  respectively. 
In  each  of  these  cases,  and  particularly  in  the  latter,  many  other  industrialists 
and  academics  attended  in  order  to  learn  the  lessons  of  that  industry.  Thus,  the 
objective  of  aoss-fertilisation  is  being  achieved. 

The  speakers  at  the  first  five  seminars  were  invited  to  prepare  chapterb, 
based  on  dieir  presentations,  for  a  book.  Twenty-two  speakers  responded,  and 
the  resulting  book  [1]  was  published  by  Chapman  and  Hall  in  1993. 

At  the  nine  seminars  held  so  far,  the  total  attendance  has  been  1283.  At  each 
event,  the  delegates  have  been  asked  to  complete  questionnaires  on  the  quality 
and  value  of  die  seminar  and,  without  exception,  die  feedback  has  been 
positive. 

By  the  end  of  1993,  two  further  one-day  seminars  will  have  been  held,  on 


Testing  and  validation  of  saf^-critical  systems'  and  'Measurement  of  safety 
and  rdiabilit/. 

63  Annual  Symposium 

The  club  has  initiated  an  annual  three-day  symposium,  the  Safety-critical 
Systems  Symposium  (SSS),  to  be  held  in  February  of  each  year.  The  first,  SSS  '93, 
held  in  Bristol,  attracted  190  delegates.  Thus,  die  total  attendance  at  the  first  ten 
club  events  was  1473  -  an  average  attendance  of  147.3  delegates. 

The  19  papers  presented  at  the  symposium  covered  a  broad  spectrum,  many 
reporting  on  research  projects  involving  collaboration  between  industry  and 
academia.  One  of  the  goals  of  the  club  is  to  provide  a  forum  for  the  reporting  of 
the  results  of  these  projects,  and  in  the  years  to  come  the  Safety-critical  Systems 
Sympoaum  will  provide  this  platform.  The  proceedings  of  the  symposium  [2] 
were  published  by  Springer- Verlag. 

SSS  '94  will  be  held  in  Birmingham  in  February  1994. 

6.4  Ad  Hoc  Activities 

The  principle  of  the  club's  existence  is  cooperation  rather  than  competition. 
Thus,  the  dub  has  partidpated  and  assisted  in  activities  not  mentioned  among 
itsprindpal  objectives.  In  this  respect,  ithasco-sponsored  events,  assisted  in  the 
oiganisation  of  workshops,  given  advice  on  safety-critical  issues,  brought 
together  potential  partidpants  of  collaborative  projects,  and  given  publidty  to 
safety-critical  matters.  The  dub  continues  to  offer  support  whenever  appropri¬ 
ate. 


7  CONCLUSIONS 

This  short  paper  has  listed  a  number  of  issues  in  the  transfer  of  technology 
between  academia  and  industry.  It  has  also  reported  on  the  experience  of  how 
a  community  dub  can  contribute  to  technology  transfer,  effectively  and  over  a 
broad  spectrum. 

In  two  years  of  operation,  the  Safety-Critical  Systems  Qub  in  the  UK  has 
attracted  a  laige  membership,  staged  ten  successful  events,  published  two 
books,  and  further  facilitated  technology  transfer  by  the  publication  of  a  regular 
newsletter,  tfie  co-sponsorship  of  events,  and  the  provision  of  advice.  It  pro¬ 
vides  a  model  which  could  be  used  in  otl^  parts  of  the  world. 
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Subsidiaries  and  start-up 
Spin-off  companies 
of  Inria 


PneBst :  Jeaa-Piene  Banaie 
bria-Reniies/Irisa 

Can^MS  de  Beaidiett  -  35042  Rennes  oedex.  PEBDoe 

This  document  is  a  summary  of  Iniia's  ptdicy  for  the  aeatian  oi  private  oonqianies 
as  a  privil^ed  means  for  the  transfer  of  basic  research  results. 

1  Inria 

The  Natknal  Institute  for  Reseaidi  in  Computer  Sdenoe  and  Contnd  Automation 
(buia)  is  a  french  sdmitific  puUic  institute  under  the  reqwnsilMlity  of  the  ministry 
oi  research  and  tecluK^gy  and  the  ministry  of  industry.  With  its  headquarters 
located  at  Rocquenooutt  near  Versailles,  Inria  has  five  research  centers  located  at 
Rooquenooutt,  Sophia-ArUyolia,  Rennes,  Grenoble  and  Nancy  respectively. 

Inria  brings  together  1000  scientists,  including  250  permanent  research  staff  and 
more  than  300  PhD  students.  Its  budget  in  1991  is  of  the  order  of  450  MF  (75 
M$).  buia's  activities  is  information  processing  and  control  theory  enconqnss  basic 
and  applied  research,  design  of  experimental  systems,  international  scientific 
exchange,  cooperative  international  programs  and  technology  transfer.  Tbe  latter  is 
probaUy  one  of  the  most  inyortant  ones  in  the  context  oi  information  technology 
where  changes  hqipen  rapidly. 

2  Transfer  of  research  is  a  must 

bitia  has,  over  the  last  10  years,  encouraged  the  dissemination  of  its  research  results 

and  flnmmmdraiwd  th«»m  fai  thft  faiAitirial  and  ariwitiflc  cnmmimitte*  Thft  traiirfer  is. 
organized  through  infixmation  semimn,  research  contracts,  reception  of  researchers 
and  engjneers  from  industry  in  baia's  research  teams  and  detachment  trflnria's  staff 
to  induatiial  reaeatdi  oenlers. 
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lliis  policy  has  been  sysienaaticaily  accompanied  by  a  diffusion,  as  large  as 
possible,  of  fesults  through  puUicaiions  and  dissemination  of  software  to  partners 
(universities,  public  and  industrial  resemcfa  cemeis,  industrial  departments  of 
research,  tiu;.)-  lliis  diffusion  allows  for  the  evduation  ci  prototypes  and  the  feed¬ 
back  by  die  users. 

3  The  creation  of  "high-tech"  companies  : 
a  solution  for  transfer 

Large  companies  are  not  always  able  to  assume  a  direct  transfer  of  research 
prototypes,  eqiecially  for  basic  ttxds  ;  this  situation  may  be  due  to  the  rigidity  of 
large  structures,  the  difficulties  to  manage  the  competition  with  internal  teams,  the 
difficulties  to  adapt  new  techncdogies  to  the  intemal  strategy,  the  cost  (rf  knowledge 
transfer,  but  mainly,  the  transfer  (tf  products  without  the  transfer  men. 

inria  has  been  encouraging  the  creation  of  high  tech  companies  by  its  own 
reseacefaers;  these  companies  remain  dose  to  research  for  the  main  reason  that  their 
staff  is  composed  almost  exclusively  of  former  researchers.  Their  business  is 
moreover  mainly  produa  otiemed. 

In  diis  context,  fourteen  companies  have  been  created  iqi  to  now  in  the  environment 
of  haria.  Tbey  are  either  subsidiaries  or  not,  but  tb^  main  goals  are  always  the 
transfer,  exploitation  and  ctmunerdalization  of  know-how  and  prototypes 
originating  from  huia. 

4  Subsidiary  or  spin-off  ? 

Inria's  legal  status  (PuMic  Institute  for  Science  and  Techutdogy)  allows  the 
institule  to  share  in  the  capital  of  private  cooqianies.  In  this  context,  Inria  created 
three  subsidiaries  (Simulog  in  1984,  Dog  in  1987, 02  Technology  in  1990)  where 
it  CQittnds  the  minority  of  die  ogutal,  and  G4»i  SA.  (coming  fitom  the  groupment 
Gqisi  SM  90)  contrcdling  a  minority  of  the  capital.  The  creation  of  such 
subsidiafiet  only  OOCU18  under  die  following  oooditiaiis : 


144 

•  alHgeinveitiiieotioreieafch, 

•  fliataicediectaology, 

•  inewbataitfaciiveiiiMlGet, 

•  averyoon^ietitivemleinatiooaloo^^ 

•  dieladcofmotivalediiiditttcidpBr^^ 

During  this  tone,  leseaichen,  eaghieefs  or  fooner  PhD  students  ot  Inria  have 
created  fifteen  companies.  They  iqxesent  one  of  die  most  dynamic  vectors  for 
transfer.  The  minority  of  them  industrialize  and  mqploit  prodncts  originating  ftom 
Inria,  under  lioenae.  These  licenses  are  negodated  on  a  non-cxclnsive  basis  and 
royalties  are  deienninaied  aooordhig  to  the  niles  as  apfdied  to  usual  partners. 

One  of  the  fifty  hceases  for  prototypes  active  in  1990,  nearly  half  of  them  involve 
qiin-off  companies  and  subsidiaries.  More  than  70%  of  the  total  amount  of 
rc^alties  are  coming  finmi  these  companies  and  the  progression  acoeterales  (only 
25%  in  1989). 

Conversely,  these  companies  help  Inria  to  understand  strategic  inftmiatkm 
regarding  market  needs,  hd^g  Inria  to  take  atral^lc  decisions  concerning  research 
orientation. 


Human  Medium  in  Technology  Transfer 


W.  Cellary 

Franco-Polish  School  of  New  Information  and  Communication  Technologies 

Poznan,  Poland 


To  analyze  problems  of  technology  transfer  between  academia  and  industry,  first,  it  is 
important  to  realize  erqrectations  of  both  sides.  In  my  opinion,  academia  expects 
mon^  and  problems  to  be  solved,  while  industry  erqrects  problem  solutions  and 
people  trained  in  research.  A  question  is:  what  is  ^e  role  of  these  people  trained  in 
research  in  technology  transfer  fixrm  academia  to  industry,  and  why  they  are  expected 
by  industry?  One  may  believe  that  a  medium  of  technology  transfer  is  paper,  i.e. 
written  documents  describing  problem  solutions.  I  claim  that  to  transfer  a  new 
technology  a  human  medium  is  required.  In  a  new  technology  (I  emphasize  word 
new,  meaning  here  revolutionary)  developed  in  academia,  i.e.  mostly  at  the 
theoretical  level,  the  main  concept  is  established,  but  a  lot  of  problems  remains 
unsolved.  I  mean  here  the  research  problems,  not  just  implementation  ones.  If  the 
concept  is  revolutionary  new,  it  is  difficult  to  transfer  it  in  its  integrity  to  a  foreign 
team.  Still  more  difficult  is  to  transfer  an  idea  how  to  solve  the  related  problems  in 
such  a  way  that  the  main  concept  is  not  violated  or  deviated.  The  most  efficient  way 
to  transfer  technology  is  to  use  humans,  i.e.  the  researchers  who  developed  it,  as  a 
medium.  This  makes,  however,  a  painful  hole  in  academia,  and  thus  is  successful  if 
academic  research  teams  are  relatively  big  and  be  split. 

Let  us  now  analyze  a  career  of  a  young,  over  average  talented  researcher  in  computer 
engineering,  who  prepared  his  PhD  in  collaboration  with  a  team  developing  a  new 
technology.  Assume  that  he  is  28  years  old  when  he  gets  his  PhD,  and  that  he  is 
hired  by  industry  to  transfer  technology  and  develop  a  product  based  on  this  new 
technology.  Nowadays,  mean  life  duration  of  a  software  product  is  around  twelve 
years.  If  we  add  three  years  to  develop  the  product,  we  get  fifteen  years.  Assume  that 
half  of  this  time  is  creative,  i.e.,  some  scientific  research  needs  to  be  performed, 
related  with  the  product.  We  may  call  it  an  offensive  phase.  The  second  half,  called 
the  defensive  phase  is  devoted  mostly  to  maintenance  and  some  development: 
moving  to  various  platforms,  integration  with  various  software  products  available  on 
the  market,  etc.  In  this  phase,  irmovations  are  restricted,  because  of  compatibility 
with  previous  prodiK^t  versions.  Our  researcher  is  35  years  old  at  the  end  of  the 
offensive  phase  of  the  product.  At  this  age  he  has  a  good  potential  of  creativity  and, 
moreover,  he  has  good  experience  in  development  of  industrial  products,  cooperative 
work  in  a  team,  etc.  There  is  no  reason  to  keep  him  to  maintain  the  product  during 


its  defensive  phase  which  is  not  creative.  He  is  at  a  good  point  of  his  professional 
career  to  assume  responsibility  for  a  new  advanced  project.  There  are,  however,  two 
menaces;  first,  that  he  will  continue  the  old  project  in  a  new  frame;  second,  that  he 
will  create  a  new  project,  but  his  role  will  be  reduced  to  the  managerial  aspects  only. 
To  avoid  these  menaces  he  has  to  be  trained  in  new  technologies  developed  in 
academia  during  the  time  when  he  was  occupied  with  his  first  product.  Seven  years 
elapsed  since  his  PhD  is  a  very  long  period  in  so  active  scientific  domain  as  computer 
engineering.  During  this  period  some  new  concepts  and  new  research  directions 
ai^)eared  that  are  more  suitable  and  more  promising  for  new  products  that  have  to 
drfend  tlramselves  against  other  products  even  ten  years  after.  A  challenge  for 
academia  is  to  integrate  and  effidendy  train  such  people,  as  a  medium  of  advanced 
technology  transfer.  They  need  some  spedal  organizational  solutions,  because  th^ 
caimot  be  simply  mixed  with  postgraduate  students.  I  think  they  need  around  two 
years  of  training:  one  year  to  study  a  new  domain,  and  one  year  to  start  to  produce 
original  results  in  this  domain.  A  good  solution  would  be  a  kind  of  sabbatical.  Its 
advantages  are  the  following.  For  industry,  it  is  an  investment  in  future  products 
using  the  most  advanced  technologies.  For  academia,  it  is  the  growth  of  research 
potential,  financed  by  industry,  and  feedback  from  practice  and  plications.  In  the 
Franco-Polish  School  of  New  Information  and  Communication  Technologies  we  are 
encouraging  industry  to  apply  this  solution. 


Technology  Transfer  -  from  Purpose  to  Practice 


Bob  Malcolm 
Malcolm  Associates  Ltd. 
Savoy  Hill,  London,  UK 


Abstract 

It  is  submitted  that  technology  tnuisfer  between  academia  and  industry  is  not  a 
matter  of  academia  telling  industry  how  to  do  things  better,  but  of  industry  better 
understanding  its  own  needs  and  being  better  able  to  evaluate  academic  work. 


1  Introduction  -  the  purpose 

One  of  the  recommendations  of  the  lEE-BCS  report  on  “Software  in  safety-related 
systems”  [1]  was  that  there  should  be  a  programme  of  awareness  and  dissemination 
of  latest  developments  and  of  best  practice,  in  parallel  with  a  research  programme. 
The  research  programme  is  now  established,  and  there  are  now  35  projects,  with 
over  130  industrial  organizations,  and  over  30  academic  institutions  participating 
[2].  This  present  paper  presents  the  considerations  involved  in  establishing  the 
related  technology  transfer  programme. 

2  Policy 

The  first  step  in  any  government-led  initiative  must  be  to  identify  the  policy  which 
both  guides  and  constrains  any  action.  In  the  present  case  the  relevant  policy  was 
set  out  in  a  UK  government  policy  paper  concerned  with  innovation  (a  ‘White 
Paper’)  from  1988  [3], 

Frequently,  technology  transfer  is  discussed  in  the  context  of  ‘encouraging  adoption 
of  best  practice’.  This  sounds  very  reasonable  at  first  hearing,  but,  if  interpreted  too 
simplistically,  it  implies  that  someone  knows  what  best  practice  is;  that  someone 
will  pick  technological  winners  which  they  will  inflict  upon  everyone  else.  Would 
you  trust  any  individual,  or  worse,  a  committee,  to  do  this  on  your  behalf? 

An  alternative  approach  was  presented  in  the  White  Paper.  This  is  that  the  role 
for  any  government-inspired  technology  transfer  activity  should  be  to  make  it  easier 
for  suppliers  to  select  the  most  appropriate  technology  for  their  business.  (Note  that 
in  this  case  ‘suppliers’  are  users  of  technology.) 

This  policy  is  motivated  by  an  economic  argument.  A  government-led  technology 
transfer  programme  is,  in  effect,  intervention  in  the  free  market.  The  economic  case 
for  government  support  of  any  such  action  must  be  that  there  is  some  kind  of  failure 
of  the  free  market  to  deliver  an  optimum  product-price  combination  to  end-users. 
However,  a  ‘perfect’  market  requires  ‘perfect  information’.  It  is  not  too  difficult  to 
make  a  case  here  that  those  operating  in  this  business  do  not  have  ‘perfect  information’ 
about  the  latest  technological  developments. 


148 


3  Theory 

A  study  was  performed  of  the  principles  of  technology  transfer,  so  as  to  inform 
those  in  government  re^nsible  for  establi^ing  any  initiative  [4].  The  study  identified 
some  of  the  parameters  to  be  considered  in  such  an  initiative,  which  were  then  used 
to  guide  the  selection  of  technology  transfer  mechanisms. 

3.1  The  role  of  technology  transfer  in  innovation 

Technology  transfer  is  implicitly  part  of  technologically  fuelled  industrial  innovation. 
Note  -  innovation  -  the  putting  to  work  of  new  developments,  which  is  distinguished 
from  their  invention. 

Taken  literally,  technology  transfer  is  the  actual  transfer  of  the  technology  -  the 
adoption  by  organizations  of  technology  developed  elsewhere,  whether  in  academia 
or  industry.  It  can  be  achieved  either  by  organic  technological  innovation  within  the 
firm,  or,  for  instance,  by  corporate  acquisition  of  a  company  with  new  technological 
capability. 

The  innovation  process  varies  from  sector  to  sector,  at  least  in  detail,  because  of 
their  different  structures.  There  are  also  differences  in  the  way  in  which  technology 
is  transferred  across  different  groups  of  sectors  -  again  due  to  different  industrial 
structures  [5].  However,  in  general,  innovation  is  a  consequence  of  diffusion  of 
knowledge.  Moreover,  von  Hippel  proposes  that  a  “significant  mechanism"' 
contributing  to  this  diffusion  is  “informal  know-how  trading'",  which  is  “essentially 
a  pattern  of  informal  co-operative  R&D.  It  involves  routine  and  informal  trading  of 
proprietary  information  between  engineers  working  at  different  firms  -  sometimes 
direct  rivals. ""  [6] 

Key  players  in  this  diffusion  of  knowledge  are  the  industrial  ‘gate-keepers’  [7]. 
These  are  the  personnel  in  companies  who  have  the  external  contacts  with  emerging 
and  prospective  technologies,  and  who  are  respected  by  the  decision-makers  and 
business  managers,  inside  the  company,  who  are  able  to  use  such  technology 
commercially.  Such  individuals  will  be  familiar  with  trends  in  their  application 
sectors  which  are  likely  to  influence  the  development  of  technology,  either  through 
market  pressure  for  technological  development  or  through  constraints  on  the  nature 
of  such  development  coming  from  emerging  regulatory  changes  or,  again,  the  market. 
‘Gatekeepers’  are  able  to  filter  ideas  and  information  and  promulgate  them 
appropriately.  They  are  important  nodes  in  a  communication  network  which  is  itself 
a  vital  part  of  technology  transfer. 

Technology  transfer  communication  should  be  seen  as  two-way  traffic.  This  is 
important  because  it  appears  that  in  many  sectors  invention  as  well  as  take-up 
innovation  often  originates  with  the  industrial  technology  users  [6].  Indeed,  recent 
research  puts  it  more  strongly:  Successful  technological  change  is  more  likely  to  be 
market-led  than  science-led.  ”  [8] 
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32  Leaders  &  followers,  small  &  large 

Companies  which  embrace  innovation  •  the  ‘leaders’  -  see  it  as  a  necessary  investment, 
while  aware  of  the  risks.  And  it  is  an  ongoing  investment,  rather  than  a  once-off 
transfer  of  instances  of  presently  available  technology.  "Companies  who  get  there 
first  do  not  benefit  from  a  lasting  loyalty  from  customers.  Successful  innovation 
means  more  than  initiating  a  product  or  technique  and  bringing  it  to  fruition.  It 
means  constant  improvement, ....  ”  [8] 

However,  many  companies  -  the  ‘followers’  -  are  not  geared  up  to  innovation. 
They  do  not  positively  adopt  this  as  their  route  to  future  wealth  [9].  The  view  they 
take  is  that  it  is  sufficient,  and  even  best  for  the  shareholders,  to  come  second  -  to  let 
someone  else  take  the  risks  [10]. 

It  is  interesting  that  the  present  political  climate  is  one  in  which  emphasis  is 
placed  on  encouraging  the  economic  growth  of  smaller  companies.  Perhaps 
surprisingly,  the  evidence  indicates  that  small  companies  are  not  noticeably  more 
innovative,  on  balance  [11].  It  is  the  large  firms  who  are  more  likely  to  have  their 
own  R&D  activity,  who  are  both  easier  to  target  and  best  equipped  to  take-up  new 
technology,  and  who  can  make  a  bigger  dent  in  the  overall  industry  culture.  As  they 
are  also  a  source  of  innovations  ripe  for  diffusion  elsewhere,  they  are  an  important 
contributor  to  the  technology  transfer  communication  network.  So,  it  would  not 
make  sense  to  exclude  such  firms  from  any  initiative. 

32  Force  or  facilitate? 

Any  technology  transfer  initiative  must  be  distinguished  from  the  actual  transfer 
itself.  Now,  the  role  of  a  technology  transfer  initiative  could  have  as  its  purpose 
either  the  transfer  itself,  or  the  facilitation  of  the  transfer. 

At  its  most  extreme,  the  former  takes  us  back  to  the  ‘technology-push’  approach, 
in  which  somebody  decides  what  should  be  transferred.  What  must  not  happen  is 
that  some  central  decision-making  committee  should  pick  winners  from  the 
technology-push  supply  side  (even  under  the  guise  of  ‘consultation’)  and  make 
everyone  use  their  preferred  technology.  After  all,  who  would  believe  that  a  central 
committee  could  be  both  sufficiently  competent  and  unbiassed  by  either  lobbying  or 
by  having  fixed  ideas?  And,  anyway,  however  well  it  is  done,  any  such  prescription 
will  ultimately  stifle  innovation. 

In  this  context,  we  should  address  the  potential  role  of  regulation  as  a  means  of 
accelerating  take-up  of  new  technology.  This  is  certainly  possible,  but  it  must  be 
handled  very  carefully  and  intelligently.  The  intention,  once  again,  is  not  to  enforce 
the  take-up  of  any  particular  technology.  Such  prescription  is  anti-competitive, 
restricts  trade,  and  stultifies  competition,  innovation,  and  technological  development 
-  just  the  opposite  effect,  in  the  long  term,  to  the  effect  which  is  sought. 

Where  regulation  is  necessary,  Rothwell  and  Zegfeld  refer  to  "the  desirability, 
where  possible,  of  formulating  regulations  that  allow  maximum  freedom  in  developing 
techn^gy  for  compliance.  ”  [7].  It  is  not  too  difficult  to  avoid  constraining  compliance 
to  particular  technologies,  it  is  at  least  as  important,  and  usually  much  more  difficult, 
to  frame  regulation  which  does  not  presume  certain  classes  of  technological  solution. 


On  the  other  hand,  if  it  can  be  done  properly,  then  tough,  technology-free, 
targets  can  stimulate  the  development  of  a  range  of  innovative  solutions.  It  is 
important  that  the  targets  are  indeed  tough,  else  the  inclination  is  to  adopt  ‘best 
practice’  targets,  based  on  existing  technology,  which  tend  to  be  lowest  common- 
denominator  targets,  thereby,  once  more,  inhibiting  competition  and  innovation  rather 
than  accelerating  it. 

Returning  to  the  alternative  of  facilitation,  any  facilitation  of  technology  transfer 
might  either  directly  support  the  explicit  adoption  of  new  technology  (while  not 
prescribing  what  that  should  be,  of  course)  or  it  can  more  indirectly  attempt  to 
overcome  -  or  undermine  -  some  of  the  barriers  to  innovation. 

The  two  major  barriers  are,  it  seems,  a  lack  of  awareness  of  new  developments 
and  practice  elsewhere  -  despite  the  availability  of  information  for  those  that  positively 
look  for  it  -  and  an  inability  to  assimilate  change,  primarily  because  of  shortcomings 
in  “the  strategic  ability  of  management  to  integrate  externally  acquired  technology 
into  an  overall  business  plan”  [9].  It  is  fairly  clear  that  a  technology  transfer 
initiative  could  address  the  first  of  these.  It  is  less  clear  that  problems  with  the 
strategic  ability  of  managements  lie  within  the  scope  of  technology  transfer.  Indeed, 
in  the  UK  there  is  now  a  much  more  broadly-based  attempt  to  inspire  an  innovatory 
culture  ([12],  for  example). 

However,  technology  transfer  actions  of  the  right  kind  can  help.  It  appears  that 
‘perceived  performance  gap’,  compared  with  fairly  close  competitors,  is  a  major 
motivator  for  innovation.  “The  technological  strategy  [of  business  units  within  firms] 
is  to  achieve  the  full  potential  of  the  product  or  process  [around  which  they  are 
organised]  in  a  way  that  at  least  matches  the  performance  of  rivals"  [8]  So  information 
about  what  is  happening  elsewhere  in  industry  is  at  least  as  important  as  information 
about  technology.  A  technology  transfer  initiative  can  address  both  of  these. 

3.4  The  players  and  their  parts 

In  essence,  the  very  much  simplified  technology  transfer  model  in  the 
Buxton-Malcolm  paper  [4]  comprises: 

•  awareness  -  coupled  with  ‘interest’ 

•  gatekeeping 

•  in-depth  economic  evaluation 

•  decision 

•  acquisition  of  technology  and  capability  in  its  use 

Again  simplifying  very  much,  we  identify  a  number  of  types  of  individual  in  an 
organization.  They  play  different  roles  in  each  of  these  stages,  and  require  different 
types  of  information  in  order  to  perform  properly: 

•  senior-managers  -  the  ‘decision-makers’  as  they  are  often  called,  except .. 

•  middle  management  -  who  might  well  put  this  year’s  bottom  line  before 
high-level  highfalutin’  ideas  about  change,  and  therefore,  in  reality,  make  the 
decisions  by  default 

•  gatekeepers 

•  engineers  -  ‘the  workers’ 

•  researchers 
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Note  the  important  requirement  for  ‘interest’  in  any  awareness  activity.  This  is 
different  from  the  salesmen’s  ‘attention’  (which  is  necessary  but  not  sufficient).  A 
‘perceived  performance  gap’  referred  to  previously  is  one  way  to  grab  both  the 
attention  and  interest  of  decision-makers  and  managers. 

After  awareness,  assuming  interest  is  aroused,  we  need  to  make  sure  that 
gatekeepers  have  access  -  and  have  preferably  already  had  it  -  to  the  sort  of  technical 
and  economic  feasibility  which  they  can  believe,  so  that  when  asked,  they  do  not 
block  the  introduction  of  technology  simply  because  of  a  lack  of  information. 

And  then,  assuming  that  interest  is  held,  organisations  will  need  convincing 
information,  from  a  reliable  source,  of  the  advantages  and  disadvantages  of  alternative 
technologies. 

Should  the  decision  be  to  proceed,  there  will  be  a  need  for  back-up;  information 
on  supply  of  tools,  training,  and  so  on.  This  must  all  be  available  well  in  advance  of 
any  decision  though,  since  availability  and  supply  will  be  considered  during  the 
earlier  deliberations  of  the  organisation. 

Note  that  the  order  of  these  stages  is  not  necessarily  linear,  and,  in  particular, 
awareness  is  often  stimulated  by  gatekeepers. 


4  The  practice  -  mission  &  mechanisms 

4.1  The  mission  statement 

Having  studied  the  theory,  the  mission  statement  for  the  proposed  technology  transfer 
programme  was  established.  This  is: 

"To  achieve,  in  the  supply  of  safety-related  programmable  electronic  systems, 
better  informed  application  of  safety  engineering  practices  and  better  informed 
choice  and  application  of  appropriate  software  technology.  ’’ 

4.2  A  choice  of  mechanisms 

Having  identified  different  classes  of  information  required  by  different  people  in  the 
diffusion  of  innovation.  Table  1  was  a  first  attempt  to  help  to  identify  which  mechanisms 
provide  support  for  which  of  the  different  aspects,  discussed  previously.  It  does  not 
purport  to  be  complete,  and  the  ‘more-blobs-the-better’  ratings  are  entirely  subjective, 
as  a  starting  point  for  discussion. 

Note  that  some  of  these  mechanisms  perform  dual  roles.  For  instance,  collaborative 
research,  technology  demonstrators,  and  ‘application  experiments’  [13]  are  all  means 
whereby  technology  can  actually  be  transferred  into  participating  organisations.  But 
to  others  they  are  perceived  as  sources  of  information,  accessed  through  the  associated 
dissemination  activities. 
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4J  The  Safety-Critical  Systems  Club 

On  the  basis  of  all  these  considerations,  and  given  the  existence  of  the  collaborative 
research  programme,  we  decided  to  combine  the  functions  of  gatekeeper  clubs  and 
sector  clubs  in  a  ‘community  club’  -  the  Safety-Critical  Systems  Club. 

The  primary  aim  of  the  club  is  to  facilitate  the  flow  of  information  between 
practitioners  within  industry  -  both  between  peers  and  between  ‘leaders’  and 
‘followers’.  By  enhancing  awareness  of  current  practices  and  of  latest  developments, 
the  club  accelerates  agreement  on  what  constitutes  best  practice,  and  enables  evaluation 
of  academic  work. 
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Since  one  of  the  aims  is  to  achieve  greater  consensus  among  both  individuals 
and  organisations,  we  must  avoid  proliferation  of  additional  organisations,  and  of 
activities  which  overlap  or  compete  with  their  activities.  The  club  is  therefore 
intended  to  undertake  new  activities  only  where  necessary,  and  to  encourage,  stimulate, 
and  perhaps  facilitate  activities  of  existing  technology  transfer  organisations  who 
may  not  be  dealing  with  this  subject  at  present,  but  who  may  be  able  to  offer  the 
most  appropriate  forum.  Such  existing  organisations  include  trade  associations, 
sectoral  research  associations,  professional  institutions,  existing  technology  clubs, 
and  publishing  houses. 

Such  a  club  is,  cheap,  simple,  informal,  and,  we  believe,  highly  cost-effective. 
At  the  last  count  there  were  over  1600  members  [14]. 

5  Effectiveness 

The  effectiveness  of  the  club  has  yet  to  be  proved.  It  is  a  requirement  that  all 
government  supported  initiatives  of  this  type !:«  evaluated.  For  evaluation  we  must 
develop  some  ‘output  measures’  from  the  mission  statement.  This  has  yet  to  be 
done,  but  for  this  kind  of  technology  transfer  activity  we  will,  for  instance,  be 
seeking  evidence  that  organisations 

*  are  better  informed  about  safety  engineering  and  software  technology 

*  review  their  activities  and  technological  needs  more  thoroughly 

•  are  better  able  to  judge  whether  changes  in  their  practice  are  desirable  and 

•  make  such  changes,  when  they  feel  they  are  appropriate. 
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Dependability:  from  Concepts  to  Limits 
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LAAS-CNRS.  Toulouse,  France 

Abstraet 

Our  society  is  faced  widi  an  ever  increasing  dependence  on  computing 
systems,  which  lead  to  question  ourselves  about  the  Umits  of  iheir 
dependability.  In  order  to  respond  this  question,  a  global  conceptued 
and  temunological  framework  is  needed,  which  is  first  given.  The 
analysis  of  the  limits  in  dependability  which  is  then  conducted 
identifies  design  faults  as  the  major  linking  factor,  a  consequence  of 
which  is  die  concluding  recommendation  of  applying  a  fault  tolerance 
approach  to  tiie  improvement  of  die  production  process. 


Introduction 

Our  society  has  become  increasingly  dependent  on  computing  systenu  and  this 
dependency  is  especially  felt  upon  the  occurrence  of  failures.  Recent  exan4>les  of 
nation>wide  computer-caused  at  'related  failures  are  the  IS  January  1S)90  tdepbone 
outage  in  the  USA,  or  the  26-27  June  1993  credit  card  denial  of  authorization  in 
France.  The  consequences  of  such  events  relate  primarily  to  economics;  however, 
some  outages  can  lead  to  endangering  human  lives  u  second  order  effects,  or  even 
directly  as  in  the  Londtm  Ambulance  Servi^  failure  of 26-27  November  1992.  As  a 
consequence  of  such  events,  which  can  only  be  termed  as  disasters,  the 
consciousness  of  our  vulnerability  to  computer  failures  is  developing,  as  witnessed 
by  the  following  quotation  from  the  repcm  Confuting  the  Future:  A  Broader 
Agenda  far  Computer  Science  and  Engineering  [COM  92]:  Tinally,  computing  has 
resulted  in  costs  to  society  as  well  as  benefits.  Amidst  growing  concerns  in  some 
sectOTS  of  society  with  respect  to  issues  such  as  unmnployment,  invasions  of 
{vivacy,  and  reliance  on  fallible  conqniter  systems,  Jie  cmnputer  is  no  longer  semi 
as  an  unalloyed  positive  force  in  the  sodeQr”. 

Faced  with  this  situation,  a  natural  questitm  is  dien  To  vdiich  extent  can  we  rely  on 
computers?",  or,  mme  precisely,  "What  are  the  limits  of  computing  systems 
depoidability?".  Responses  to  th^  questions  need  a  ctmcqitual  and  terminological 
framewmk  for  dependability,  which  in  turns  is  influenced  by  the  analysis  of  the 

This  work  wu  partially  supported  by  the  ESPRIT  Basic  Resaerch  Action  PDCS 
(Predictably  Conqmting  Syste^  project  no.  6362) 
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limits  in  dependability.  Such  a  framework  can  hardly  be  found  in  the  many 
standardizatitm  efforts:  as  a  consequence  of  their  specialization  (telecommunications, 
avionics,  rail  transportation,  nuclear  plant  control,  etc.),  they  usually  do  not 
ctmsider  all  possible  sources  of  failures  which  can  affect  ctnnputing  systems,  nw  do 
thqr  consider  all  attributes  of  dependability. 

The  considerations  expressed  in  the  above  two  paragraphs  have  guided  the  contents 
of  the  p^ter,  vdiich  is  composed  of  two  sections.  The  first  section  is  devoted  to  the 
main  definitions  relating  to  the  dependability  concq)t;  those  definitions  elabmate  on 
the  definitions  given  in  [Liq>  92a].  The  second  section  deals  with  the  limits  of 
dq)endability. 

1  The  Dependability  Concept 

1.1  Basic  definitions 

Dependability  is  defined  as  the  trustworthiness  of  a  computer  system  such  that 
reliance  can  justifiably  be  placed  on  Ae  service  it  delivers.  The  service  delivered  by  a 
system  is  its  be^vior  as  it  is  perceptible  by  its  us«(s);  a  user  is  another  system 
(human  or  physical)  which  interacts  with  the  fnmer. 

Depending  on  the  application(s)  intended  for  the  system,  different  emphasis  may  be 
put  on  different  facets  of  dependability,  i.e.  dq;)endability  may  be  viewed  according 
to  different,  but  complementary,  properties,  which  enable  the  attributes  of 
dependability  to  be  defined: 

•  the  readiness /or  usage  leads  to  avaOabQity, 

•  the  continuity  of  service  leads  to  reliability, 

•  the  avoidance  of  catastrophic  consequences  on  Ae  environment  leads  to 
safety, 

•  the  avoidance  of  unauthorized  disclosure  of  information  leads  to 
confldcntiality, 

•  die  avoidance  of  improper  alterations  of  irtformation  leads  to  integrity, 

•  the  ability  to  undergo  repairs  and  evolutions  leads  to  maintainability. 

A  system  failure  occurs  when  the  delivered  service  is  not  up  to  fulfilling  the 
system's  function.  An  error  is  that  part  of  the  systmn  state  which  is  liable  to  lead 
to  subsequent  failure:  an  error  affecting  the  service  is  an  indication  that  a  failure 
occurs  or  has  occurred.  The  adjudged  or  hypothesized  cause  of  an  error  is  a  fault 

The  development  of  a  dependable  computing  system  calls  for  the  combined 
utilization  of  a  set  of  methods  and  techniques  which  can  be  classed  into: 

•  fault  prevention:  how  to  prevent  fault  occurrence  or  introduction, 

•  fault  tolerance:  how  to  ensure  a  service  up  to  fulfilling  the  system's 
function  in  die  presence  of  fiiuilts, 

•  fault  removal:  how  to  reduce  the  presence  of  faults, 

•  fault  forecastfaif:  how  to  estimate  the  present  number,  die  future  incidence, 
and  die  consequences  of  fruits. 
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The  notions  introduced  up  to  now  can  be  grouped  into  three  classes  and  are 
summarized  by  figure  1: 

•  the  impairments  to  dependability:  faults,  errors,  failures;  they  are  undesired 
—  but  not  in  princ4>le  unexpected  —  circumstances  causing  or  resulting  from 
un-dependability  (whose  definition  is  very  simply  doived  from  the  definition 
of  dependability:  reliance  cannot,  or  will  not  any  longer,  be  placed  on  the 
service); 

•  the  means  for  dependability:  fault  prevention,  fault  tolerance,  fault  removal, 
fault  frnecasting;  these  are  the  methods  and  techniques  enabling  one  a)  to 
provide  the  ability  to  deliver  a  service  on  which  reliance  can  be  placed,  and  b) 
to  reach  confidence  in  this  ability; 

•  the  attributes  of  dependability:  availability,  reliability,  safety, 
confidentiality,  integrity,  maintainability;  these  a)  enable  the  properties  which 
are  expected  from  the  system  to  be  expressed,  and  b)  allow  the  system  quality 
resulting  from  the  impairments  and  the  means  opposing  to  them  to  be 
assessed. 


p- ATTRIBUTES  — 

p  AVAILABIUTY 

-  REUABIUTY 

-  SAFETY 

-  CONFIOENTIAUTY 

-  INTEGRITY 
MAINTAINABILITY 

DEPENDABILITY  aia 

P  FAULT  PREVENTION 
-  FAULT  TOLERANCE 

>  FAULT  REMOVAL 

L-  FAULT  FORECASTING 

r  FAULTS 

L  IMPAIRMENTS^  ERRORS 

^  FAILURES 

Figure  1  -  The  dependability  tree 


1.2  On  the  Attributes  of  Dependability 

The  definition  given  for  integrity  —  the  avoidance  of  improper  alterations  of 
information  —  generalizes  the  usual  definitions  (e.g.,  prevention  of  unauthorized 
amendment  or  deletion  of  information  [EEC  91],  car  ensuring  iq>ivoved  alteration  of 
data  [Jac  91])  which  are  directly  related  to  a  specific  class  of  faults,  i.e.  intentional 
faults,  that  is  deliberately  malevolent  actions.  Our  definition  encmqMsses  accidental 
faults  as  well  (i.e.  faults  which  appear  or  ate  created  fortuitously),  and  the  use  of  the 
word  information  is  aimed  at  not  being  restricted  to  data  strictly  speaking:  integrity 
of  programs  is  also  an  essential  concern;  regarding  accidental  faulU,  error  recovery  is 
indeed  aimed  at  restoring  the  system's  integrity.  It  is  also  noteworthy  that  our 
definition  can  be  interpreted  by  default  in  orda  to  encmnpass  subtle  attacks  against 
integrity,  such  as  preventing  suitable  dau  updates.  Integrity  is  a  prerequisite  for 
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availability,  reliability  and  safety,  but  it  may  not  always  be  so  for  confidentiality,  as 
in  the  case  of  passive  attacks  using  covert  channels  or  wire-taping. 

Confidentiality,  not  security,  has  been  introduced  as  a  basic  attribute  of 
dependability.  Security  is  usually  defined  (see  e.g.,  [EEC  91])  as  the  combination  of 
confidentiality,  of  integrity  and  of  availability,  where  the  latter  are  understood  with 
respect  to  unauthorized  actions,  whereas  availability  is  not  usually  restricted  to  such 
events,  tux  is  integrity  according  to  the  above  discussion;  in  addition,  as  noted  in 
[Gas  88],  confidentiality  is  the  most  distinctive  characteristic  of  security.  A 
definition  of  security  gathering  the  three  aspects  of  [EEC  91]  is:  the  jwevention  of 
unauthorized  access  and/or  handling  of  information;  security  issues  are  indeed 
dominated  by  intentional  faults,  but  not  restricted  to  them:  an  accidental  (e.g., 
I^ysical)  fault  can  cause  an  unexpected  leakage  of  infonnation. 

The  definition  given  for  maintainability  —  ability  to  undergo  repairs  and  evolutions 
—  deliberately  goes  beyond  corrective  maintenance,  which  relates  to  repairability 
only.  Evolvability  clearly  relates  to  the  two  other  forms  of  maintenance,  i.e.  a) 
adaptive  maintenance,  which  adjusts  the  system  to  environmental  changes,  and  b) 
perfective  maintenance,  which  improves  the  system's  fimction  by  responding  to 
customer,  and  designer,  defined  chwges.  It  is  noteworthy  that  the  frontier  between 
repairability  and  evolvability  is  not  always  clear,  for  instance  if  tbe  requested  change 
is  aimed  fixing  a  specification  fault  [Ghe  91].  Maintainability  actually  conditions 
dependability  when  considering  the  whole  operational  life  of  a  system:  systems 
v^h  do  not  undergo  a(Uq)tive  or  perfective  maintenance  are  likely  to  be  exceptions. 

From  their  definitions,  availability  and  reliability  emphasize  the  avoidance  of 
failures,  safety  the  avoidance  of  a  specific  class  of  failures  (catastrc^hic  failures),  and 
security  the  prevention  of  what  can  be  viewed  as  a  specific  class  of  faults  (the 
prevention  of  the  unauthorized  access  and/or  handling  of  information).  Reliability 
and  availability  ate  thus  closer  to  each  other  than  they  are  to  safety  on  one  hand,  and 
to  security  on  the  other;  reliability  and  availability  can  thus  be  grouped  together 
[Lap  92b,  Jon  92],  and  be  collectively  defined  via  the  property  of  avoiding  or 
minimizing  the  service  outages.  However,  this  renuurk  sh(^d  not  lead  to  consider 
that  reliability  and  availability  do  not  depend  on  the  system  environment:  it  has 
long  been  recognized  that  a  computing  system  reliability/availability  is  highly 
correlated  to  its  utilization  fnvfile,  the  faUutes  due  to  physical  faults  or  to  design 
faults  (see  e.g.,  [lye  82]). 

The  properties  enabling  the  definition  of  the  dependability  attributes  may  be  more  (v 
less  emphasized  depending  on  the  application  intended  for  the  computer  system 
under  consideration.  For  instance,  availability  is  always  required,  idthough  to  a 
varying  degree  depending  on  the  application,  whereas  reliability,  safety, 
confidentiality  may  or  may  not  be  required  according  to  the  q>plication.  The 
variations  in  die  emphasis  to  be  put  on  t^  attributes  of  dependability  have  a  direct 
influence  on  the  qipropriate  balimce  of  the  means  to  be  employed  in  order  that  the 
resulting  system  be  de^ndable.  This  is  an  all  the  move  difficult  problem  as  some  of 
the  attributes  are  antagonistic  (e.g.,  availability  and  safety,  availability  and 
confidentiality),  necessitating  that  tradeoffs  be  performed.  Considering  the  three 
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main  design  dimensions  of  a  computer  system,  i.e.  cost,  performance  and 
dq;iendabiUty,  the  problem  is  further  exacerbs^  by  the  fact  that  the  dependability 
dimension  is  less  mastered  than  the  cost-performance  design  space  [Sie  92]. 

The  discussitm  cmiducted  in  this  section  is  summarized  (m  figure  2. 


Rgure2-  Relationship  between  the  dependability  attributes 


2  Limits  in  Dependability 

Dependable  cmnputing  systems  are  specified,  developed,  operated  and  maintained 
according  to  assumptions  which  are  relative  a)  to  the  f^tion(s)  to  be  fulfilled,  b)  to 
the  environment  whoe  the  computing  system  is  to  be  operated  (load,  perturbations 
from  the  physical  environment,  bduivior  of  the  operators  and  maintainors),  and  c)  to 
the  fruilts  which  are  likely  to  manifest,  in  terms  of  their  modes  and  frequencies.  The 
achieved  dependability  is  crucially  depending  upon  the  validation  a)  of  the  actual 
system  with  respect  to  these  assumptions,  b)  of  the  assumptions  themselves  with 
respect  to  reality,  and,  recursively,  c)  of  the  assuiiq)tions  of  the  validation  itself 
(e.g.,  criteria  according  to  which  fault-removal  is  conducted).  Limiting  factors  to 
dqwndability  can  thus  originate  from  a  variety  of  sources,  due  to  inadequacies  of  the 
development-operation  assumptions,  ot  to  impmfections  in  the  validation  of  the 
system  and  of  those  assumpticms. 

Dealing  in  details  with  all  the  above  issues  is  cleariy  out  of  reach,  all  the  more  as 
those  issues  are  in  fact  closely  related.  In  the  sequel,  we  discuss  the  relationship 
between  function  and  failure,  we  investigate  the  effectiveness  of  fault  tolerance  with 
respect  to  various  types  of  faults,  and  we  examine  the  distinction  which  has  to  be 
performed  between  ^  dependability  which  is  actually  achieved  and  the  estimated 
dqmidability. 

2.1  Function  and  failure 

The  function  of  a  system  is  what  the  system  is  inunded  for  [Kui  8S].  The  function 
is  usually  first  described,  or  specified,  in  terms  of  what  should  be  fulfilled  regarding 
the  system's  primary  aim(s)  (e.g.,  performing  transactions,  controlling  or 
monitming  a  plant,  piloting  an  airplane  ix  a  rocket).  When  considering  safety-  or 
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security-related  systems,  this  description  needs  to  be  supplemented  with  what  should 
not  hiq)pen  (e.g.,  the  hazardous  states  from  which  a  catastrophe  may  ensue,  or  the 
disclosure  of  sensitive  information).  Such  a  description  may  in  turn  lead  to 
specifying  additional  functions  that  the  system  should  fulfill  in  order  to  reduce  the 
likelihood  of  what  should  not  happen  (e.g.,  exhibiting  a  fail-safe  behavior  or 
authenticating  a  user  and  checking  his/her  access  rights). 

The  description  of  a  system's  function(s)  can  be  perfcvmed  at  various  levels  of 
details,  according  to  several  means  of  expression,  from  natural  language  to 
mathematics.  A  system  may  fail  with  respect  to  a  given  function,  ot  with  respect  to 
a  given  level  of  detail,  and  still  comply  with  the  others. 

Expressing  a  system's  function(s)  is  an  activity  which  is  naturally  conducted  from 
the  very  first  steps  of  a  system's  development.  However,  we  all  know  that 
specifying  a  system's  functions  extends  to  the  whole  system's  life,  due  to  the 
inherent  difficulty  of  eliciting  a  system's  requirements.  If  the  largest  amount  of 
effort  devoted  to  this  elicitation  indeed  takes  place  at  the  beginning  of  the  system's 
development,  it  does  not  end  with  what  is  traditionally  called  "requirements  analysis 
and  specification"  in  life-cycle  models  such  as  the  waterfall  or  the  V  models;  this 
elicitation  practically  continues  during  all  phases  of  a  system's  development,  and 
during  operational  life  as  well:  perfective  maintenance  is  often  pofcmned  in  order  to 
correct  what  are  finally  specification  faults,  i.e.  oversights,  mistakes  or  omissions. 
This  remark  is  all  the  more  important  when  considering  failures:  an  unacceptable 
behavior  can  indeed  be  detected  as  such  from  its  deviation  from  complying  with  the 
specification,  but  it  may  happen  that  it  complies  with  the  specification  and  is  felt  as 
unacceptable  by  users,  thus  in  fact  uncovering  a  specification  fault;  leaving  aside  the 
question  of  the  consequences  of  such  a  situation,  it  can  be  said  that  in  fact  it  helps 
eliciting  what  the  real  function  of  the  system  ought  to  be.  This  type  of  problem 
should  not  be  minimized,  for  instance  in  advociuing  that  systems  generally  exhibit 
more  frequently  failures  due  to  the  subsequent  phases  of  their  development  than 
failures  traceable  to  imperfections  in  the  definition  of  their  requirements  (see  e.g., 
[Gla  81]);  however,  besides  the  fact  that  severity  has  to  be  accounted  for  in  addition 
to  frequency,  this  frequency  argument  does  not  apply  to  safety-related  systems;  a 
consequence  of  the  extreme  care  put  into  their  design  and  realization  is  that 
specification  imperfections  generally  constitute  a  significant  source  of  the  problems 
faced  with  in  operation. 

2.2  Effectiveness  of  fault  tolerance 

Before  discussing  the  effectiveness  of  fault  tolerance,  let  us  stress  that  from  our 
definition  —  ensuring  a  service  up  to  fulfilling  the  system's  function  in  the  presence 
of  faults,  fault  tolerance  is  a  mean  for  providing  a  system  with  the  ability  to  behave 
as  expected,  be  it  fail-safe  or  fail-operational. 

The  imperfections  of  fault  tolerance,  i.e.  the  lack  of  fault  toleremce  coverage, 
constitute  a  severe  limitation  to  the  increase  in  dependability  which  can  be  obtained. 
Such  imperfections  of  fault  tolerance  are  due  either  a)  to  d^gn  faults  affecting  the 
fault  tolerance  mechanisms  with  respect  to  the  fault  assumptions  made  during  the 


design,  die  consequence  of  n^ch  is  a  lack  of  error  and  fault  handlmg  coverage,  or  b) 
to  fault  assunqidons  which  differ  from  the  faults  really  occurring  in  operation, 
resulting  in  a  lack  of  fault  assumption  coverage,  which  can  be  in  turn  due  to  eithn* 
i)  failed  component(s)  not  behaving  as  assumed,  that  is  a  lack  of  failure  mode 
coverage,  or  ii)  the  occurrence  of  correlated  failures,  that  is  a  lack  of  failure 
independerue  coverage.  The  influence  of  a  lack  of  oror  and  fault  handling  coverage 
[Bou  69,  Am  73]  has  been  shown  to  be  such  that  it  not  only  drastically  limits  the 
dependability  improvement,  but  that  in  some  cases  adding  further  reduiklancies  can 
result  in  lowering  dependability  [Dug  89].  Similar  effects  can  result  from  the  lack  of 
the  other  forms  of  fault  tolerance  coverage:  conservative  failure  mode  assumptions 
(e.g.,  inconsistent,  or  Byzantine  failure  modes)  will  result  in  a  higher  failure  mode 
coverage,  at  the  expense  of  necessitating  an  ii^tease  in  the  redundancy  which  can 
lead  to  an  overall  decrease  in  the  system  dependability  [Pow  92];  correlated  failures 
ate  known  to  defeat  fault  tolerance  strategies  based  on  the  replication  of  identical 
items,  either  because  of  a)  common  design  faults,  or  of  b)  identical  sensitivity  to 
workload  or  extemally-induced  perturbations  [Hec  87]. 

Another  significant  source  of  limit  in  dependability  are  temporary  faults.  They  have 
long  been  recognized  as  constituting  the  vast  majority  of  hardware  faults,  and  the 
progresses  in  hardware  integration  can  only  emphasize  this  tendency  (see  e.g.,  the 
field  data  from  several  sources  in  [Sie  92]).  A  dirKt  consequence  is  t^ ,  although  it 
is  not  always  so  [Geb  88],  emphasis  should  be  placed  in  the  design  of  fault-tolerant 
systems  on  discriminating  between  temporary  and  permanent  faults:  the 
misiaterpretation  of  a  temporary  fault  as  a  permanent  fault  results  in  an  unnecessary 
decrease  in  the  available  r^undrmcies,  thus  in  lowering  dependability. 

Temporary  faults  are  not  limited  to  hardware:  the  notion  of  temp(»ary  fault  applies 
to  software  as  well.  Although  such  a  notion  has  been  introduced  a  long  time  ago 
[Elm  72],  and  mote  recent  studies  have  shown  that  most  of  the  software  faults 
present  during  opaational  life  ate  temporary  faults  [Gra  86],  the  very  notion  of 
tempmary  software  fault  is  often  felt  as  contradicting  our  perception  of  software.  In 
fact,  if  it  is  not  arguable  that  the  ultimate  cause  of  software  faults  are  present  as 
long  as  they  are  not  fixed,  it  has  to  be  recognized  that  most  software  faults 
manifesting  in  operation  in  large,  complex,  software  are  subtle  enough  in  order  that 
their  activation  conditions  depend  on  equally  subtle  combinations  of  internal  state 
and  external  solicitation,  so  that  they  can  hardly  be  refwoduced.  Stated  in  other 
terms,  the  failure  domain  of  temporary  software  faults  can  vary  with  the  conditions 
of  execution  of  the  software,  and  be  a  null  space  under  most  operating  conditions. 
Although  it  is  generally  recognized  that  software  is  the  current  bottleneck  in  terms 
of  dependability  [Gra  90,  Cra  92],  fault  tolerance  q>iHoaches  aimed  at  temporary 
software  faults  have  been  paid  little  attention  [Gra  96,  Hua  93]  when  compared  to 
the  work  devoted  to  tolerating  permanent  software  faults,  which  necessitates  design 
diversity  [Ran  75,  Qie  78,  Lap  90].  Because  of  the  high  cost  of  design  diversity, 
software  fault  tolerance  is  currently  mostly  limited  to  some  safety-critical 
applications  [Vog  88]  such  as  avionics,  and,  to  a  lesser  extent,  railway 
transportation  or  nuclear  plant  monitoring.  The  main  limiting  factor  of  the 
undeniable  improvement  in  dependability  thought  about  by  design  diversity  is 
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constituted  by  the  unavoidable  conelations  between  the  software  variants  issued 
fix>m  diversified  designs  [Eck  91],  which  are  a  form  of  lack  of  failure  independence 
coverage. 

This  brief  overview  of  the  factors  limiting  dependability  in  toms  of  fault  classes 
would  not  be  complete  without  mentioning  the  man-machine  interaction  faults, 
resulting  firom  inappropriate  interactions.  Their  importance  is  not  new,  and  is  not 
restricted  to  safety-critical  systems  [Toy  78,  Dav  81],  although  they  are  generally 
felt  as  a  rntgor  source  of  failure  in  those  systems,  for  the  same  reason  as  already 
mentioning  regarding  specification  faults,  i.e.  the  extreme  care  put  into  their  design 
and  realization.  Recognizing  that  most  of  interaction  faults  are  in  fact  design  faults 
of  the  system  [Nor  83]  should  encourage  the  development  of  approaches  for  their 
tolerance  [Max  86,  Rou  87]. 

Fault  tolerance  is  not  restricted  to  accidental  faults,  which  have  been  dealt  so  far  in 
this  section.  Some  mechanisms  of  error  detection  are  directed  towards  both 
intentional  and  accidental  faults  (e.g.  memory  access  protection  techniques)  and 
schemes  have  been  furoposed  for  the  tolerance  to  both  intrusions  (i.e.  intentional 
interaction  faults)  and  physical  faults  [Rab  89,  Des  91],  as  well  as  for  tolerance  to 
nudicious  logic  (i.e.  intentional  design  faults)  [Jos  88].  However,  fault  tolerance  is 
far  fiom  being  recognized  as  a  viable  approach  to  security  issues,  in  spite  of  the  fact 
that  the  continuously  increasing  cost  of  failures  attributable  to  intentional  faults 
clearly  show  that  the  current  approaches,  mainly  based  on  fault  avoidance  (i.e.  fault 
prevention  and  fault  removal),  do  not  constitute  a  fuUy  satisfactory  answer. 

2.3  Achieved  dependability  and  estimated  dependability 

It  is  now  widely  admitted  that  perfcmning  dependability  evaluation  of  hardware-fault- 
tolerant  systems  without  accounting  for  the  lack  of  fault  tolerance  coverage  can  only 
lead  to  grossly  overestimated  evaluations  of  dependability.  However,  it  is 
noteworthy  that  if  error  and  fault  handling  coverage  has  been  devoted  a  large 
attention  due  to  the  ability  of  estimating  it  via  fault-injection  (see  e.g.  [Gun  89, 
Arl  90,  Cho  92]),  it  is  not  so  for  assumption  coverage  as  it  needs  experience  to  be 
accounted  for  in  addition  to  fault-injection. 

Regarding  software  reliability  modeling  and  prediction,  the  vast  majority  of  the 
published  material  focuses  on  the  development  and  validation  phases.  Much  less 
studies  encompassing  the  c^ierational  phase  appear  in  the  open  literature.  As  already 
mentioned  in  the  previous  section,  a  clear  tendency  in  computing  systems  is  to  see 
software  becoming  the  major  source  of  failure;  there  is  thus  a  growing  need  for 
performing,  during  their  validation,  reliability  evaluations  of  software  systems  in 
order  to  forecast  their  (future)  reliability  in  opoaticHi. 

The  current  approaches  to  software  reliability  evaluation  cannot  satisfy  this  need. 
The  main  reason  lies  in  the  fact  that  they  consider  products  in  isolation  from  the 
process  which  produced  them:  the  reliability  predictions  performed  for  a  given 
software  either  a)  from  its  successive  times  to  failure  for  reliability  growth 
modeling,  or  b)  from  its  execution  times  without  failures  for  statistical  testing,  can 


only  be  far  below  any  reasonable  reliability  requirement  (see  e.g.  [Par  90,  The  91] 
for  statistical  testing).  In  addition,  those  predictions  usually  (and  hopefully)  are  also 
far  below  what  will  be  observed  in  operation  (see  e.g.  [Kan  87,  Gra  90]). 
Considering  that  software  systems  iMX)duced  fiom  scratch  are  exceptions,  and  that  the 
current  usual  q)proach  is  rather  to  make  evolutions  from  existing  software,  a  logical 
consequence  is  to  enhance  the  predictions  performed  fcv  a  given  product  from  data 
relating  to  its  validation  with  field  data  relative  to  previous,  similar,  software 
products  [Lap  92c].  The  comer  stone  of  the  approach  is  clearly  the  notion  of 
similarity  between  the  various  generations  of  a  software  family;  however,  in 
addition  to  the  usually  mentioned  negative  dissimilarities  resulting  from  added 
failure  sources,  we  can  expect  positive  dissimilarities  to  exist,  resulting  from 
progress  in  the  development  and  validation  methods  and  techniques. 

3  Conclusion 

The  discussion  conducted  in  the  previous  section  clearly  shows  that  design  faults 
constitute  the  major  limitation  to  dependability  of  computing  systems,  be  they 
fault-tolerant  or  not  This  is  not  at  all  surprising:  a  computing  system  is  a  human 
artifact  and  as  such  any  fault  in  it  or  affecting  it  is  ultimately  hunum-made  since  it 
represents  human  inability  to  master  all  the  phenomena  which  govern  the  behavior 
of  a  system.  As  a  direct  consequence,  pushing  forward  the  limits  of  dependability  of 
computing  systems  can  only  be  done  via  improving  their  production  process. 

Improvements  in  the  production  process  can  indeed  be  noticed,  when  looking  at  the 
evolutions  between  tte  successive  editions  of  guidelines  for  the  development  and 
validation  of  computing  systems  (a  noticeably  good  example  is  the  recent  new 
release  of  the  guidelines  for  certifying  airbonw  software,  DO-ITS-B,  as  compared  to 
the  previous  release,  DO-178-A).  This  is  however  a  feedback  loop  which  takes  a 
considerable  amount  of  time  in  order  to  make  profit  of  experience.  Some  shcxrter 
feedbacks  are  currently  being  explored,  as  the  previously  mentioned  work  we  are 
carrying  out  on  software  reliability  evaluation  [Lap  92c],  ot  the  recently  published 
work  on  employing  accumulated  knowledge  in  software  testing  [Wil  92].  Both 
examples  relate  to  validation:  they  are  aimed  at  improving  fault  removal  and  fault 
forecasting,  but  they  are  not  directly  and  explicitly  concerned  with  the  progressive 
reduction  of  the  faults  created  during  the  development  and  inserted  in  the  developed 
system  before  fault  removal  takes  place.  This  is  where  the  most  effective  feedback 
loop  should  exist,  which  would  be  nothing  else  than  applying  to  the  production 
process  a  fault  tolerance  approach:  detecting  the  errors  in  the  development, 
diagnosing  and  removing  the  causes,  i.e.  the  faults  in  the  production  (nrocess. 

We  have  to  be  conscious  that  such  an  approach  questions  directly  the  very 
organization  of  the  producers:  carrying  out,  and  making  profit  of,  the  necessary 
experience  does  not  q>pear  conqutible  with  the  project-mriented  mganizations  which 
currently  prevail.  It  can  only  be  hoped  that  the  statistics  on  the  continuously 
growing  costs  of  computer  failures  can  be  a  strong  enough  incentive  to  moving  the 
organizational  inertia.  Let  us  just  mention  that  the  yearly  computer  failure  cost  is 
now  exceeding  more  than  10  Billions  of  Francs  in  France,  both  accidental  and 
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intentional  faults  included  (and  has  been  constantly  lai:ger  than  the  profit  of  the 
whole  computing  industry  in  France,  including  construction,  distribution  and 
services),  and  is  close  to  4  Billions  of  Dollars  in  the  USA  for  acddmital  fiuilts. 
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1.  Introduction 

Sizewell  ’B*  is  s  Westiii^liouse  designed  Nuclear  Pressurised  Water  Reactor  (PWR) 
currently  bong  built  in  Sizewdl,  Suffolk  in  the  UK.  It  possesses  two  diverse 
protection  systems  whose  role  is  to  provide  an  automatic  reactor  trip  when  plant 
conditions  reach  safety  limits  and  to  actuate  enmgmcy  safeguard  fes^ures  to  limit 
consequences  of  a  failure  condition. 

The  Primary  Protection  Systran  (PPS)  is  a  micn^uocessor  based  system  developed 
by  Westin^iouse  in  genraal  agreranent  with  lEC  880  [1].  The  PPS  is  suiq)orted 
a  non-conqwtra-  based  Secondary  Protectirm  System  (SPS),  based  on  Laddie 
technology,  developed  in  the  UK  by  GEC.  However,  deqjite  the  presence  of  the 
SPS,  the  software  widiin  the  PPS  is  considered  to  be  ’safety  critical'  since  the  PPS 
on  its  own  is  requited  to  meet  an  integrity  requirranent  of  10^  failures  per  demand. 

The  UK  Nuclear  Installations  inqwetorate  (Nil),  the  regulatory  body  reqxMBsible 
for  certificating  the  station,  have  two  main  criteria  for  software  based  safety 
systems,  ruunely  excellence  of  production  and  indepraident  assessment.  The  rigour 
of  each  of  these  is  required  to  be  commensurate  wife  fee  level  of  criticality  and 
safety  dependence  of  fee  software. 

As  a  result  of  feis  requirement  for  independent  assessment.  Nuclear  Electric,  the 
utility  req)onrible  for  Sizewdl  'B',  have  had  a  range  of  assessments  ccHKhicted  (m 
fen  independent  of  the  equipment  manufacturer.  One  of  fee  main  activities 
is  fee  use  of  fee  static  atudysis  tool  MALPAS  to  rigorously  verify  fee  software 
against  its  specifications. 

2.  Scope  of  Independent  Assessment  Activities 

The  overall  objectives  of  any  rigorous  independent  software  assessment  are  to 
demonstrate,  to  as  hi^  a  level  as  is  practicable,  that  the  object  code  in  fee  PROMs 
implements  fee  user  requirements,  correctly,  completely,  safely  and  without  side 
effects.  However,  wife  current  software  tedmology,  no  single  activity  is  able  to 
demonstrate  feiis  to  fee  required  level  of  confidence.  Hence  a  range  of  activities 
are  necessary,  each  providing  assurance  of  individual  steps  in  the  software 
development  life  cycle  such  that,  wfeen  taken  togethra*,  the  indepraident  assessment 


activities  provide  the  required  level  of  assurance  that  the  last  8tq>  in  the  process 
meets  die  first. 

On  die  SizBwdl  ’B*  PPS  five  main  activities  are  being  conducted,  as  follows: 

•  Engineering  Confirmatory  Analysis  conducted  by  NNC  Ltd,  involves  the 
manual  review  of  all  qiecificatioos,  from  the  system  design  requiremrats  down 
to  the  low  level  code  qiecifications,  and  die  review  of  all  source  code  and  data, 
to  ensure  ovmsU  omsisteocy  and  progressive  implementation  of  the 
requiranents. 

•  Independent  Design  Assessment  conducted  by  Nuclear  Electric’s  own  team, 
ensures  duU  all  essential  system  fiincdcmal  requirements,  including  design 
principles  derived  from  AGR  reactor  protecdcm  system  expmence,  are  correcdy 
incorporated  into  the  Systmn  Design  Specification  and  down  through  the 
different  stages  of  the  software  design. 

•  MALPAS  Analysis  conducted  by  TA  Consultancy  Services  Ltd,  involves  the 
fiirmal  vnification  of  the  source  code  against  its  ^lecificadons  and  is  discussed 
in  detail  throughout  the  rest  of  this  p^ier. 

•  Ofctject/Soiiroe  Code  Comparison  conducted  by  Nuclear  Electric’s  own 
Indqieodent  Design  Assessment  team  and  discussed  in  detail  in  refermce  [2], 
aimg  to  diminate  the  possibility  of  errors  being  introduced  by  the  compiler  and 
linker,  by  formally  demonstrating  (again  using  MALPAS)  equivalence  betwem 
object  code  in  the  PROMs  and  the  source  code  from  which  it  was  gmerated. 

•  Dynamic  Testing  conducted  by  Rolls  Royce  and  Associates  Ltd,  involves  the 
conduct  of  an  extensive  series  of  tests  (^iproximately  55000  randomly  generated 
teat  cases)  on  (me  of  the  four  identical  channels  of  the  PPS. 

In  total  diese  activities  are  expected  to  have  involved  around  250  man  years  of 
effort,  an  amount  equivalent  to  that  ^)eat  by  the  software  manufacturer  in  their  own 
devdopment  and  verificatiem  work,  by  the  time  that  the  software  is  certificated  at 
the  end  of  1993.  Aldiough  high,  this  levd  of  effort  is  considered  necessary,  one 
reasem  bdng  because  the  PPS  software  design  pre-dates  the  practied  iq}plication  of 
the  latest  formd  software  design  methods.  Consequently  the  secondary  aim  of  the 
static  analysis  of  source  code  has  bem  to  inopose  formalism  on  the  overall 
developmmt  process. 

The  main  aim  of  the  MALPAS  analysis  of  the  software  is  to  verify,  as  formally 
as  possible  dut  the  Sizewell  ’B’  software  (source  crxle)  meets  its  q)ecifications. 
This  voificatiem  enconqmsses  both  the  manud  conqMiison  of  the  andysis  results 
against  tfie  design  qmcificatiems  and  dso  the  ’proof  of  code  against  a  mathematied 
rqrresentatiim  of  the  detailed  design  specifications.  However,  it  is  inqmrtant  to 
qrpreciate  from  the  above  that,  vdiilst  the  MALPAS  andysis  is  the  single  largest 
activity,  it  is  still  one  of  a  number  of  independoit  assessment  activities  which,  takoi 
togedier,  are  intended  to  provide  conqrrdiensive  coverage  and  maximise  confidmee 
in  die  safety  of  the  software. 

3.  Background  to  MALPAS  Analysis 

In  1988,  following  consideratiem  of  the  activities  necessary  to  demonstrate 
correctness  of  the  PPS  software.  Nuclear  Electric  concluded  that  andysis  methods 
were  recpiired  which  could  demonstrate  ccmformance  with  specifiedions  and  give 


high  levels  of  coofideoce  in  iieedoni  liom  errors.  It  was  considered  that  dynamic 
testing  on  its  own  would  not  be  able  to  give  the  required  level  of  confidrace 
because  of  known  limitations  of  testing,  such  as  the  inability  to  achieve  conq>lete 
path  coverage  on  any  non-trivial  piece  of  software.  These  same  concerns  had  beoi 
addressed  a  few  years  previously  by  the  UK  Ministry  of  Deforce  and  it  was 
dmefore  decided  by  Nuclear  Electric  to  follow  the  same  approach  adopted  by  the 
MoD,  namely  to  use  techniques  known  as  Static  Analysis. 

The  tool  dussen  by  Nuclear  Electric  for  the  work  was  MALPAS  [3]  whidi  itself 
originated  from  the  UK  MoD  research  in  the  1970s  and  1980s.  MALPAS  was 
chosen  because  it  provided  the  level  of  formal  aiudysis  considered  essential  and  was 
considered  to  be  the  tool  most  suited  to  rdrospective  analysis  of  code  that  has  not 
been  develc^red  wife  formal  verificaticMi  in  mind.  The  tool  consists  of  three  main 
sets  of  analysers,  namdy  the  Flow  Analyse,  the  Semantic  Analyser  and  the 
Conqrliance  Analysa*,  each  of  ndiich  is  able  to  provide  a  progressively  more 
rigorous  verification  of  the  code  against  its  specifications.  It  was  Nuclear  Electric’s 
decisirm,  ri^t  from  the  beginning,  that  all  the  MALPAS  analysma  should  be  used 
on  the  PPS  software,  with  the  enq>hasis  being  on  the  Compliance  Analyser. 

4.  PPS  Software  Description 

The  Primary  Protection  System  consists  of  four  identical  guardlines  (channels), 
physically  and  electrically  separated  from  each  other,  which  perform  coincidence 
(2  out  of  4)  voting  to  determine  the  need  for  action.  Each  guardline  contains  a 
number  of  sub-systems,  providing  facilities  such  as  reactor  trip,  communications, 
engineering  safety  features,  and  auto  test.  Each  subsystem  comprises  a  graeral 
purpose  ’host’  processor  and  a  number  of  slave  processors.  The  host  processor 
performs  the  unique  functions  required  by  the  subsystem  and  is  supported  in  this 
by  the  specialised  slave  processors  which  provide  standard  functions  such  as 
communications,  analogue  data  acquisition  and  diagnostic  monitoring. 

The  PPS  software  is  primarily  written  in  PL/M-86  with  some  ASM86  and  small 
amounts  of  PL/M-51  and  ASM51  variant  code.  In  total  there  are  q>proximately 
100,000  lines  of  unique  executable  code  with  a  typical  host  processor  containing 
40,(XX)  lines  of  source  code  and  a  typical  slave  processor  10,000  lines.  The 
software  is  highly  modular  and  has  beoi  writtoi  mostly  as  ’general  purpose’ 
reusable  software  providing  a  range  of  common  functions,  configured  through  the 
use  of  coofiguraticm  data.  There  is  a  relatively  small  amount  of  applications  level 
code  providing  the  particular  functionality  of  the  protection  system. 

The  software  is  developed  from  two  levels  of  ^)ecifications,  namely  a  Software 
Design  Requirements  (SDR)  which  is  a  high  level  specification  of  the  required 
functi<mality  of  each  section  of  the  code,  and  a  Software  Design  Specification 
(SDS)  udiich  describes  how  the  functional  requirements  are  met  by  the  design. 
Both  of  these  s^  of  specifications  are  written  in  ’natural  language’  (American 
English).  The  SDS  includes  details  of  the  precise  functionality  of  each  program 
section,  and  also  contains  data  flow  tables  providing  details  of  all  program  section 
variables  and  inputs  and  outputs. 
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5.  Overview  of  MALPAS  Analysis  Process 

The  analysis  of  the  PPS  software  is  conducted  on  a  procedure-by-procedure  basis 
in  a  bottom-iq)  muwer.  That  is  to  say  that  the  analysis  conunences  with  those 
procedures  that  call  no  others  and  then  progresses  iq>  the  call  hierarchy  until  the  top 
levd  q>plications  code  is  reached.  All  the  MALPAS  analysers  are  run  on  each 
procedure  and  the  results  v«ified  against  the  two  levds  of  q)ecification.  The 
hitler  levd  qiecification  (SDR)  is  taken  to  be  the  primary  docummt  against  wiiich 
die  code  is  vnified,  with  siqiporting  information  provided  by  the  lower  level  SDS. 
All  of  the  code  that  can  be  accessed  during  on-line  (^leraticm  is  being  subjected  to 
MALPAS  analysis,  ineqiective  of  the  perceived  criticality  of  individual  sections 
of  code  within  die  system. 

The  analysis  process  has  not  changed  in  ccmcqit  since  the  woik  began  at  the  start 
of  1989.  However,  the  detail  of  how  the  woric  is  conducted  has  changed 
substantially  both  as  experience  has  been  gained  over  the  years  and  with  the 
progresave  increase  in  the  size  of  the  analysis  team.  The  following  sections 
discuss  various  aqiects  of  the  analysis  process. 

6.  Translation 

It  is  necessary  for  source  code  to  be  translated  into  MALPAS’s  own  input 
language,  known  as  IL,  before  analysis  can  take  place.  EL  is  a  strongly  typed 
langu^e  with  many  features  similar  to  those  of  Ada.  One  feature  in  particular, 
similar  to  Ada  padrages,  is  that  all  EL  procedures  comprise  sqiarate  bodies  and 
qieciftcations  (known  as  PROCSPECs).  The  advantage  of  this  is  that,  following 
the  analysis  of  the  procedure  body,  all  that  is  required  for  analysis  of  calls  to  that 
procedure  is  the  procedure  specification.  This  feature  greatly  fiu;ilitates  bottom-up 
analysis. 

As  has  been  mentioned  above,  the  majority  of  the  code  in  the  PPS  is  written  in 
PL/M-86  and  an  automatic  translator  to  cmivert  PL/M-86  into  IL  was  developed 
by  TA  Consultancy  Services  qiecifically  for  this  project.  Considerable  care  was 
taken  in  the  derivation  of  the  mappings  betweoi  PL/M-86  and  IL  to  msure  that 
they  were  a  strictly  correct  rqnesmtation  of  the  semantics  of  the  PL/M*86 
language  and  that  they  were  at  an  apprc^ate  level  of  abstraction/precision  to 
fireUitate  analysis.  The  translator  was  also  designed  to  produce  a  number  of 
checks,  for  exaiiq>le  for  exceeding  array  bounds  or  loop  counter  overflow,  that 
could  be  formally  checked  during  Semantic  and  Compliance  Analysis. 

The  traruriator  takes  an  input  PL/M-86  source  text,  typically  a  module,  along  with 
odier  rdevant  inputs,  such  as  PROCSPECs  fiom  previously  analysed  procedures, 
and  autmnatically  produces  an  IL  translaticm.  Error  and  warning  messages  are 
given  where  language  features  are  eocountned  that  the  translator  is  unable  to 
convert  to  IL  automatically.  The  analyst  that  has  to  take  appropriate  action, 
<**«*i«g  die  accuracy  of  the  particular  part  of  the  translation  or  making  suitable 
manual  changes. 

Pointers  are  one  PL/M-86  language  feature  that  can  cause  such  problems,  firsUy, 
because  diere  is  no  concept  of  pointers  within  IL  and,  secondly,  because,  they 
represent  a  source  of  aliasing  whidi  contradicts  IL’s  philosophy  of  all  variables 


being  distinct  and  diqoint  with  all  assignments  having  no  side  effects. 
Unfortunately  PL/M-86  is  a  pointer-based  language  with  the  use  of  pointers  being 
essential  for  a  number  of  opnations.  The  most  significant  of  these  concerns  the 
passing  of  OUT  or  INOUT  parameters  to  procedures  which  must  be  performed  by 
reference  (ie  pointers  must  be  used  to  pass  any  item  to  a  procedure  that  may  be 
modified  by  dut  procedure). 

The  solution  adopted  for  the  analysis  is  for  all  pointers  to  be  ’dmeferenced’  (ie 
die  pointm-  replaced  by  the  itmn  pointed  at)  during  the  translation  process. 
Fortunately  die  use  of  pointers  is  fairly  well  constrained  in  the  PPS  software 
(dirough  the  application  of  Westinghouse’s  coding  standards)  and  most  pointms  are 
used  mly  at  the  procedure  call  stage  and  point  to  a  single  variable/memory 
locadmi.  The  tranriator  is  therefore  able  to  automatically  dereference  the  majority 
of  pointers  used  within  the  code.  TlKise  that  the  translator  is  unable  to 
automatically  dereference  (for  exanple  points  to  templates)  are  brought  to  the 
attenti<ni  of  the  analyst  through  error  and  warning  messages  and  the  analyst  then 
has  to  inpl«nent  an  agreed  (and  subsequendy  reviewed)  w^niial  translation. 

7.  Analysis  Process 

Every  program  section/procedure  in  the  source  code  is  subjected  to  the  three  main 
stages  of  MALPAS  analysis  in  sequence,  with  the  following  particular  activities 
being  conducted  under  each  mie. 

7.1  Flow  Analysis 

This  involves  the  analysis  of  the  flow  of  (xmtrol,  data  and  information  through  each 
procedure,  in  particular: 

•  Control  Flow  Analysis  The  verification  of  safe  control  flow  through  each 
procedure  (eg  ensuring  die  absmce  of  multi-entrant  loops  and  black  holes)  and 
the  confinnation  that  the  code  is  well  structured. 

•  Data  Use  Analysis  The  analysis  of  the  use  of  all  parameters  and  variables  within 
each  procedure  to  oisure  that  this  use  agrees  with  that  detailed  in  the  code 
pecificati(»s  (SDS)  and  to  ensure  that  the  usage  is  safe  and  in  accordance  with 
genml  software  engineering  good  practice  rules. 

•  Information  Flow  Analysis  The  analyris  of  all  dependencies  betwem  input  and 
output  parameters  for  eadi  procedure  to  ensure  foat  these  a^ree  with  the  code 
pecificadons  (SDR/SDS),  and  the  idendflcatioo  of  redundant  statements. 

7.2  Semantic  Analysis 

This  analysis  involves  the  confinnation  bodi  that  the  functionality  of  the  code 
conforms  with  the  pacifications  (SDR  and  SDS)  and  is  reasonable  from  an 
engineering  knowledge  of  die  system.  The  MALPAS  Semantic  Analyser  converts 
the  (potentially  confusing)  sequential  logic  of  the  code  of  each  procure  into  a 
clearer  parallel  form,  givinga  precise  matimnatical  relationship  between  inputs  and 
outputs  for  each  path  dirou^  dut  procedure.  The  analyst  is  t^  able  to  check  die 
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detailed  swamtics  of  every  path  Arough  the  program  section  and  manually  conyare 
this  gainst  die  qiedficatioa  to  vnify  (me  a^unst  the  otho'. 

7.3  Compliaiice  Analysis 

The  purpose  of  the  Compliance  Analysis  is  to  obtain  the  highest  levd  of  confidence 
in  die  correctness  of  the  software  by  ftmnally  verifying  the  code  against  a 
madmnatical  qiecificadim.  Specific  objectives  are  to  demonstrate  that  the  code  of 
each  procethire: 

•  conforms  to  die  functional  aqiects  of  its  qiecificaticm  (SDR,  siqiported  by  SDS), 

•  leqiects  any  state  invariants 

•  performs  its  specified  functions  without  comqiting  die  conqmting  envircmment 
and  widKHit  comqiting  data  owned  by  any  ot^  nxxhile  or  pnx^edure 

•  conforms  to  die  static  smnantics  of  die  language  in  v^(di  it  is  writtmi 

The  MALPAS  Conqiliance  Analyser  requires  the  qiecificadim  to  be  rqiresented 
as  PRE  and  POST  conditi(ms  for  each  procedure,  alcmg  widi  any  necessary 
ASSERT  statements  within  the  body  of  die  code,  and  the  analyst'  will  then  show 
v^iedier  the  code  meets  the  qiecificaticm.  The  analyser  identifies  any  differences 
between  die  code  and  qiecifications  as  a  ’Threat*,  expressing  this  as  a  mathematical 
expressi<m  if^ch  details  die  domain  over  which  ^  code  and  the  specificaticm 
disagree.  When  the  code  and  specificadcm  agree  the  tool  shows  that  the  Threat  is 
felse. 

The  Conqiliance  Analysis  is  by  far  the  most  inqiortant  part  of  the  MALPAS 
analysis  of  the  PPS  software  and  also  involves  the  majority  of  the  effort.  This 
effort  is  expended  parUy  in  the  derivadcm  of  the  mathematical  qiecification  and 
pardy  in  the  provisicm  of  guidance  to  the  tool  to  aid  sinqilification  of  the  Threat. 
The  analyst  is  also  recpiired  to  derive  invariants  for  each  loop,  expressing  these  as 
ASSERT  statements,  so  that  properties  of  each  loop  can  be  proven. 

The  first  part  of  the  Conqiliance  Analysis  wrnk  is  the  crmstructicm  of  the 
madmnatical  qiecificaticm  from  the  rutfural  language  SDR  and  SDS.  This  work 
represents  a  substantial  challenge  to  ensure  both  correct  interpretation  of  the 
existing  qiecifications  and  that  the  inqmrtant  functionality  and  properties  are 
modelled  in  the  mathematical  qwcificadons. 

Ideally  the  aiudyst  should  be  able  to  define  a  high  level  abstract  mathenutical 
qiecificati(m  from  the  SDR  and  dien  derive  refinement  detail  from  the  SDS.  For 
example,  a  functional  relaticmship  could  be  defined  between  an  output  and 
qipropriate  uqwts,  usiiig  infonnatirm  from  the  SDR,  and  detailed  smnantics  can 
then  be  defined  (using  an  IL  feature  known  as  rqilacement  rules  -  similar  to  OBJ 
rewrite  rules)  from  infbrmaticm  in  the  SDS.  In  practice,  due  to  the  SDRs  and 
SDSs  bmng  at  varying  levels  of  detail  and  precisicm,  the  ideal  is  rarely  possible  and 
analyst  <«lri1l  and  judgement  is  required  to  derive  the  mathematical  qiecificaticm 
from  botfi  qiecification  documents. 

The  second,  and  most  time  consuming,  part  of  the  analysis  is  the  use  of 
technirpies  to  assist  die  Conqiliance  Arulyser  in  its  demonstraticm  of  \riiether  the 
direat  is  fidse.  MALPAS  contains  a  powmfiil  Algebraic  Simplifier  for  simplifying 
mtpresstons  but,  like  all  sudi  tools,  it  has  its  limitations,  particularly  with  the 
simplification  of  some  forms  of  expressicws.  Analyst  assistance  may  involve  the 
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le-writmg  of  mathematical  qiecifications  into  forms  more  amouble  to 
simplificatioa,  the  use  of  rqplacraieot  rules  to  define  additional  theorems  or  the  use 
of  extra  ASSERT  statmneats.  The  dmivation  of  a  ’Threat  False’  statement  is 
drerefore  an  itmative  process,  involving  running  the  tool,  assessing  the  Threat 
cooditicm  and  dien  making  appropriate  changes  to  reduce  the  expressicm  on  a 
subsequent  run  of  die  tool. 

8.  Ensuring  Correctness,  Consistency  and 
Reproducibility 

Deqiite  using  an  automatic  tool  for  the  rigorous  static  analysis,  it  can  be 
appreciated  from  die  above  description  that  substantial  analyst  skill  and  judgement 
is  involved  in  intmprrting  the  tool’s  ouqiut  and  in  deriving  the  mathematical 
qiecificatiiMis.  Because  of  this  it  is  essential,  given  the  size  of  the  project,  to 
ensure  that  all  analysts  are  performing  the  work  to  the  same  standard.  It  is  also 
considered  inqiortant  to  ensure  that  all  analysis  is  reproducible  and  could  be  re-run, 
for  exanqile  if  a  query  was  ever  raised  by  the  regulatory  authority  on  a  particular 
piece  of  analysis. 

Five  main  measures  have  been  adopted  by  TA  Consultancy  Services  to  ensure  that 
diese  requirements  are  met.  The  first,  perh^  obvious  one,  is  that  all  work  is 
conducted  within  a  defined  quality  system  and  to  the  requiremmts  of  BS57S0 
(ISO9001).  The  second,  related,  measure  has  bem  the  derivation  of  a  detailed 
’standards  and  procedures’  document,  ^iproximately  200  pages  long,  which  defines 
in  great  detail  precisely  how  the  arulysis  is  to  be  conducted.  This  document, 
arulogous  to  a  coding  code  of  practice  for  software  development,  is  followed  by 
all  aiudysts  and  ensures  uniformity  of  approach. 

The  third  measure  is  to  ensure  conq>rehensive  recording  of  all  aspects  of  analysis. 
In  addition  to  the  MALPAS  aiudysis  results  themselves,  analysts  are  required  to  fill 
out  a  series  of  forms  to  summarise  the  results  and  to  record  their  mterpretation  of 
die  results,  the  background  to  their  derivation  of  the  mathematical  specification, 
thmr  assessmmt  of  the  conformance  of  the  code  and  specifications,  and  details  of 
all  anomalies. 

The  fourdi,  and  possibly  most  inqwrtant  measure,  is  the  conduct  of  peer  reviews 
of  all  analysis  work.  The  main  aims  of  these  reviews  are  to  oasure  that  the  defined 
processes  have  been  followed  for  the  analysis  and  to  oisure  that  Compliance  proofe 
are  wdl  founded,  that  loq>  terminaticm  has  been  demonstrated  and  that  rqilacemrat 
rules  are  correct.  In  additicm  they  help  to  distribute  knowledge  of  different  parts 
of  the  PPS  software  and  experience  of  the  range  of  analysis  techniques  and  to 
geoHally  ensure  ccmsistency  between  analysts.  The  reviews  are  conducted  on  a 
contrcdled  set  of  results,  using  a  series  of  checklists  and  review  deficiency  forms. 

The  final  measure  is  the  enforcemmt  of  strict  configuration  control  of  all 
documents  and  analysis  results,  both  in  piq>er  and  magnetic  media  form.  In  terms 
of  computm<VAX)-ba8ed  configuiati<m  control  a  rigidly  defined  manual  system  is 
used  involving  controlled  and  frozen  directories  with  a  named  librarian  being  the 
only  person  authorised  to  move  files  from  one  status  to  another. 
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9.  Technique  Refinement 

The  analysis  tedmiques  used  on  the  project  have  been  c<»tinuously  refined  and 
improved  during  the  4Vi  years  of  die  project.  This  has  covered  everything  from 
die  forms  on  idiich  die  interpretation  of  the  results  are  recorded  (despite 
streamlining,  there  are  still  perpdual  conqilaints  about  too  much  paperwork)  to  the 
format  and  depth  of  reviews.  Howevw,  changes  in  the  techniques  used  are 
necessarily  slow  to  be  inqilemented,  firsdy  to  ensure  compatibility  with  the  analysis 
of  lower  levd  procedures,  and,  secondly,  because  of  the  inertia  and  diverse  views 
of  such  a  large  team. 

The  area  in  which  diere  has  undoubtably  been  the  most  improvemmt  is 
Conqiliance  Analysis.  Substantial  experience  has  been  gained  in  the  most  effective 
ways  to  express  mathematical  qiecifications  in  order  to  be  able  to  demonstrate 
conformance  with  the  code.  Numerous  technical  papers  have  bera  writtm 
internally  to  provide  a  series  of  hints  and  tips  to  all  analysts  on  how  to  resolve 
particular  problems  that  arise  during  Compliance  Analysis,  either  in  the  expression 
of  the  mathematical  jpeciiication,  or  in  the  i^olution  of  the  Threat  expression  to 
false. 

It  has  also  been  of  substantial  benefit  for  TA  Consultancy  Services  to  themselves 
be  the  developers  and  suppliers  of  MALPAS,  as  well  as  the  users  of  the  tool  on 
this  project.  Analysts  have  had  access  to  the  implementors  of  the  tool  and  have 
been  able  to  obtain  expert  advice  on  the  best  way  to  rq>resent  specifications  in 
order  to  facilitate  expression  simplification.  Furthermore,  experience  from  the 
analysis  project  has  been  fed  into  the  develc^mmt  of  the  algebraic  simplifier  to 
improve  its  performance  with  qiecific  types  of  expressions,  to  the  extent  that  speed 
improvemmits  in  excess  of  two  orders  of  magnitude  have  been  made  to  the 
Compliance  Analyser  in  some  areas. 

One  interesting  aqiect  is  that  new  techniques  continue  to  be  required  even  after 
4'A  years.  This  is  primarily  because  different  areas  of  code  are  encountered  that 
present  a  new  series  of  problems.  One  area  that  required  considerable  effort  in  the 
past  was  the  analysis  of  ASM86  code  involving  double-length  arithmetic.  Another 
area  uiiere  a  whole  range  of  new  techniques  has  bem  necessary  concerns  the 
analysis  of  the  shared  memory  communication  system  used  on  the  PPS.  It  is 
possible  that  this  could  be  the  subject  of  a  paper  on  its  own  at  a  later  date. 

10.  Analysis  Limitations 

It  is  ijiqx>rtant  to  recognise  that  the  technique  being  performed  is  Static  Analysis 
x^ch,  by  definition,  is  crmcemed  with  the  ncm-dynamic  aspects  of  the  software. 
Aldiough  many  dynamic  and  timing  related  aspects  of  the  software  are  modelled 
and  analysed  in  detail  during  the  MALPAS  work,  others  are  not  if  it  is  considered 
more  appropriate  for  them  to  be  verified  by  other  means.  Where  such  cases  arise, 
notifk^on  is  given  to  Nuclear  Electric  diat  the  specific  aspects  have  not  been 
verified  statically  and  that  odier  means  are  required. 

The  aspects  not  verified  statically  relate  primarily  to  those  conconed  with  real 
time  opoation  and  with  interaction  with  hardware.  For  example  the  checking  of 
a  communicatioos  protocol  with  requimnents  of  ’waits’  of  specified  lengths  of  time 


may  be  more  ^ipropriate  througb  dynamic  testing  on  the  targ^  hardware.  This  is 
partly  because  testing  of  all  padis  through  such  procedures  is  likely  to  be  possible 
but,  more  importantly,  because  dynamic  testing  will  effectivdy  also  provide 
validatioii  of  ^  qwcificatioa  and  will  reveal  witedier  the  design  actually  works  in 
practice. 

11.  Reporting  and  Resolving  Results 

The  only  ddivorables  to  the  customer  (Nuclear  Electric)  resulting  from  die 
MALPAS  analysis  are  comments  detailing  the  anomalies  (ie  difFermces  between 
code  and  specifications  etc)  found  during  the  work.  All  comments  are  provisimially 
categorised  according  to  their  criticality  by  TA  Consultancy  Services  prior  to  the 
comments  being  reported  to  Nuclear  Electric.  The  following  five  categories  are 
used  for  this  rqiorting  process: 

Cat  1:  Essential  code  change  to  address  potential  maloperation  of  the  PPS 

Cat  2:  Specificati<m  or  requirements  change. 

Cat  3:  Code  change  to  remove  non-cridcal  anomalies  or  to  address  necessary 
inqirovmnents. 

Cat  4:  Comments  for  which  no  action  is  required 

Cat  U:  IntMim  categorisation  to  be  resolved  at  sentencing 

This  last  category  is  intended  to  cover  commoits  wh^  the  analyst  has 
insufficient  information  r^arding  the  rest  of  the  system’s  operation  to  be  able  to 
determine  the  severity  of  the  anomaly. 

Following  the  rqporting  of  provisionally  categorised  comments  a  sentencing 
process  is  undergone  between  TA  Omsultancy  Services,  Nuclear  Electric  and 
Westinghouse  (often  also  held  with  the  Nil  in  attmdance).  During  the  sratencing 
process  each  comment  is  discussed  and  resolved  and  given  a  mutually  agreed  final 
categorisation  (using  categories  1-4  above)  vriiidi  determines  the  corrective  action, 
if  any. 

12.  Project  Status  and  Results  to  Date 

The  MALPAS  analysis  work  started  in  January  1989  and  is  scheduled  to  complete 
at  die  end  of  October  1993.  Because  of  the  size  of  the  analysis  task  and  because 
the  ’production’  version  of  the  software  was  not  going  to  be  ready  much  more  than 
a  year  before  the  required  certification  date,  the  analysis  conunmced  on  ’pre- 
production’  versions  of  the  software. 

Obviously,  there  are  re-analysis  cost  implications  of  analysing  early  software 
versions  and  the  analysis  team  has  been  growing  to  optimise  the  costs  of  the 
analysis  and  at  die  same  time  meet  the  required  timescales.  In  particular  the 
growth  of  the  analysis  team  has  been  nq>id  since  the  start  of  1992,  when  TA 
Consultancy  Services  had  a  team  of  15  people  working  on  the  task,  to  the  presoit 
tune  in  May  1993  uriien  the  team  has  grown  to  in  excess  of  80  per^le  working  full 
time  (including  managers,  analysts  and  dediciUed  support  staff). 

Considering  die  results,  in  terms  of  numbm  of  anomalies  raised,  the  latest 
figures  available  rdate  to  April  1993  at  which  time  analysis  had  been  completed 
and  commnts  sentenced  for  55%  of  the  productitm  version  software.  Just  under 
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2000  comments  had  beoi  raised  from  the  analysis  of  this  software,  a  rate  of  around 
one  comment  for  every  30  lines  of  code.  This  rate  compares  well  with  other 
’traditionally*  developed  and  verified  safety  critical  software  on  which  TA 
Consultancy  Sovices  have  conducted  similar  retroq>ective  M ALPAS  analyses. 
After  smtencidg  the  majority  of  the  comments  (52%)  were  classified  as  category 
4  (no  actirm  commoat),  40%  were  category  2  (specification  change)  and  8%  were 
category  3  (non-critical  code  change).  There  were  no  category  1  comments. 

The  no  actirai  commoits  cover  a  wide  range  of  raised  anomalies,  some  of  which 
are  suggestions  for  code  or  design  improvemorts,  for  example  to  make  code  more 
defensive,  but  uiiich  are  considered  not  to  justify  code  changes.  Others  may  relate 
to  trivial  q)ecification  deficiencies  or  points  for  clarificatirai.  Similarly  the  category 
3  commoits  cover  a  wide  range  of  instances  \«diere  it  has  been  considered  necessary 
to  make  changes.  Some  have  been  because  of  straightforward  deficiencies  such  as 
type  mismatches  at  procedure  calls  and  others  relate  to  improvements  in  code 
defensivoiess. 

13.  Conclusions 

The  M  ALP  AS  Analysis  work  conducted  on  the  Sizewell  ’B’  Primary  Protection 
system  Software  has  shown  the  feasibility  of  conducting  a  rigorous  retrospective 
analysis  on  a  large  software  system  that  has  not  bera  developed  with  this  form  of 
aiulysis  in  mind.  The  tools,  methods  and  techniques  used  for  the  work  have  bear 
shown  to  be  very  suitable  but  the  costs  of  their  use  in  such  a  retrospective  manner 
are  high. 

At  a  more  detailed  level  the  project  has  demonstrated  the  braefits  of  both 
Compliance  Analysis  and  of  in-dq)th  reviews  of  all  work.  During  Compliance 
Analysis,  it  has  bera  found  that  the  twin  activities  of  deriving  the  mathematical 
specifications  and  directing  the  tool  to  show  conformance,  results  in  both  the  code 
uid  q)ecifications  being  scrutinised  in  the  minutest  detail,  thereby  leading  to  the 
idmtification  of  subtle  but  potentially  significant  problems. 

The  braefits  of  the  reviews  in  terms  of  the  consistracy  and  checking  that  they 
provide  are  considered  to  fir  outweigh  their  not  insignificant  cost  (in  terms  of  time 
takoi  for  the  reviews  and  any  necessary  rework).  Similarly  with  the  Conq}liance 
Analysis,  although  the  costs  of  Compliance  Analysis  are  high  in  absolute  terms  they 
are  low  in  relation  to  the  costs  of  any  potential  software  malfimction. 
ConsequCTtly,  the  conduct  of  rigorous  static  analysis,  including  Conq)liance 
Analysis,  is  considered  to  be  essential  for  software  within  systems  of  the  criticality 
and  potential  failure  consequraces  of  the  Sizewell  ’B’  PPS. 

Whilst  not  claiming  that  the  aiulysis  provides  a  formal  proof  of  the  safety  of  the 
software,  the  analysis  does  provide  a  formal  verification  that  the  source  code  meets 
its  low  level  and  intermediate  level  specifications.  The  analysis  has  increased  the 
integrity  of  the  software  through  code  and  documentation  modifications  that  have 
bem  made  as  a  result  of  anomalies  raised  during  the  analysis.  Furthermore, 
through  its  rigour,  the  analysis  has  greatly  increased  confidence  in  the  correctness 
of  the  software  and  it  is  hoped  that  this,  taken  along  with  the  four  Independent 
Assessment  activities  will  greatly  contribute  towards  a  successful  certification  of  the 
software  by  the  UK  NIL 
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Abstract 

This  psqier  describes  a  safety  critkal  computer  system  used  for 
automatic  train  ctmttoL  It  has  been  develt^ied  during  the  last 
three  years  and  is  currently  in  the  frfiase  ^  final  testing  and 
validatitm.  After  a  short  system  ovoview,  the  paper  will 
concoitrate  on  safety  aqiects  in  system  design  and  cm  the 
description  of  the  verification  and  validation  process  that  was 
chosen.  This  ^)ecifically  includes  the  problems  and  aspects  of 
the  selectitxi  of  ty^licable  noims,  the  ^finidon  of  a  validaticm 
and  verificatim  plan  and  the  upper  levels  of  verification. 

Keywords 

automatic  train  control,  verificadrai  &  validatirai  plan,  fault  tree 
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1  IntroductioD 

In  the  past  decade  suburban  puUic  transportation  has  experienced  an  enormous  rise 
in  popularity.  In  Switzerland,  numy  existing  railway  companies  are  conCmited 
widi  need  of  increased  train  C2q>acity  and  train  frerpiency. 

As  many  of  these  networks  consist  of  mostly  single  track  lines  and  are  built  in 
densely  built  areas,  increasing  the  ci^ity  of  the  existing  lines  is  often  the  only 
srdution  to  suiting  these  demands. 

Railway  safety  has  reached  a  vwy  high  standard.  One  oi  the  Inggest  remaining 
safety  proMems  today  is  the  supervision  d  train  drivers.  The  above  mentioned 
incrutted  train  speeds  and  traffic  density  oi  today's  railroading  leads  to  an 
increased  pressure  on  the  train  drivers  and  therefore  to  an  increased  chance  of 
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human  enor.  Also  due  to  the  mentioned  factors  the  probability  and  ocMisequences 
of  accidents  incase  of  human  errors  have  drastically  increased. 

Baaed  on  these  circumstances  four  railway  companies  under  co-ordinatitm  of  the 
Swiss  Federal  Bureui  of  Transportation  defined  a  S)ecification  for  a  new, 
advmiced  automatic  train  control  system. 


2  System  Overview 

2.1  General  System  Requirements 

A  study  of  incidents  in  the  past,  and  of  the  present  safety  situation  led  to  a  system 

requirement  q)ecification.  It  is  helpful  for  the  general  understanding  of  the 

proposed  solution  to  summarise  the  major  requirements. 

•  Ttain  speed  has  to  be  supervised  continuously,  bused  «i  signal  aspects,  track 
conditions  and  permitted  train  speed.  This  supervisiem  has  to  include  the  brake 
curves  in  order  to  maintain  a  permitted  by  these  sometimes  ovahq)ping 
constraints  at  any  time. 

•  A  speed  of  10  km/h  in  excess  oi  the  permitted  value  has  to  be  prevented  at  any 
time. 

•  Trains  must  be  prevented  from  unauthorised  departures  in  both  directions  as 
well  as  from  overrunning  oi  signals  at  danger  (  e.g.  str^  indicatiem ). 

•  Signalled  slops  must  be  adhered  to  within  a  limit  of  10  m. 

•  Disturbance  of  train  operation  has  to  be  kept  to  a  minimum.  Apart  from 
entering  ttain  data  prior  to  the  first  departure  train  crews  must  not  manipulate 
the  system  during  ttain  operations. 

•  A  change  to  a  less  restrictive  signal  aspect  has  to  be  recognised  inunediately 
by  the  syston.  The  capacity  of  existing  lines  has  to  be  utilised  fully.  This 
includes  the  speed  profile  given  by  track  geometry. 

•  The  proposed  system  has  to  be  fail  safe,  e.g.  a  safety  certificatimi  similar  to 
that  for  solid  state  interlocking  systems  is  requited. 


As  none  oi  the  currently  produced  ATC-systems  fulfilled  the  given  specification 
nor  had  the  capacity  of  getting  modifKd  up  to  an  appropriate  level  it  was  decided 
to  develop  a  new  system.  This  also  allowed  us  to  take  bmiefit  of  the  enormous 
advances  in  technology  during  last  few  years. 
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System  Functioiuility 

The  foUowing  paragnqilis  will  give  a  short  description  ot  the  fiincticxiality  of  the 
ATC-sysiem . 

The  proposed  new  ATC-system  is  used  to  supervise  train  movonents  according  to 
signal  aspects  and  track  conditions  and  to  prevent  any  unauthorised  exceeding  oi 
the  given  speed  limits  at  any  time. 

It  is  based  on  a  data  base  which  contains  a  route  map  that  describes  the  network  oi 
a  siq)avised  railway.  This  data  base  (xxitains  the  track  geometry  (  distances, 
gradient,  permitted  track  qpeed,  pmnts,  track  airangonent  etc. ),  the  locatirai  of 
signals,  level  crossings  etc.  and  general  data  like  the  braking  characteristics  of 
vehicles  etc.. 

This  data  base  is  stored  in  a  data  module  cm  board  of  each  vehicle.  This  module  is 
easily  changeable  in  case  of  changes  in  the  network  data. 

Signal  indications  and  p(^  podtiems  of  a  certain  area,  usually  a  statitm,  are 
transmitted  cyclically  to  any  vehicle  within  this  area.  Individud  addressing  of 
qrecific  vehicles  as  well  as  communicatitm  from  train  to  track  is  not  necessary.  A 
I^ot  line  was  chosen  as  a  transmissiem  system  but  data  radio  or  similar  technology 
could  also  be  used. 

Computers  on  board  of  the  vehicles  are  permanently  iqxlating  their  position  within 
the  network  data  with  the  aid  of  wheel  rotadm  oicodns.  Wheel  slip  gets 
suppressed  by  an  algorithm  similar  to  the  ones  used  in  wheel  slip  monitoring 
systems.  The  position  measuremoit  is  readjusted  at  certain  locations  with  the  aid 
passive  synchixmisation  pmnts.  Their  exact  location  is  recorded  in  the  data  base. 

The  on-board  computer  of  each  vehicle  permanently  evaluates  its  route  based  on 
die  currait  position  of  the  vehicle,  the  direction  of  movement,  the  data  base  and 
the  received  pc^  positions.  It  searches  the  route  for  qieed  restricting  elonents  like 
signals,  track  ^leeds,  speeds  over  points  etc.  and  calculates  the  maximum 
permitted  qieed  at  any  moment  (  Figure  1  ).  If  the  actual  speed  reaches  the 
pomitted  value,  a  warning  is  issued  to  the  driver  requiring  him  to  reduce  the  train 
speed.  Whoi  the  speed  exceeds  the  permitted  speed  by  a  defined  value  (e.g.  3 
km/h)  the  computer  actuates  die  legeneradve/rheostalic  brake.  As  som  as  the 
speed  is  below  die  limit  of  die  speed  curve  the  braking  is  diacontinued.  In  die  evmu 
of  a  speed  excess  greater  dian  vdut  is  considered  toleraMe  (e.g.  10  km/h)  the 
conqwter  actuates  an  emergmicy  brake  apidication.  The  latter  should  only  occur  if 
a  driver  doesn't  react  property  to  an  issued  warning  and  if  the  vdiicle  doesn't  react 
to  die  apidkation  of  the  regoierative/irheostatic  brake,  whether  due  to  insufiiciem 
adhesion  or  with  older  vehicles  because  they  are  not  equipped  with  a 
regenerative/irheostatic  brake. 
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Teaipoiary  lestrictkHis  (  e.g.  constniction  sites  tto.  )  tut  programmed  into  the 
statiooary  (XMqwters  and  transmitted  to  the  vdiicles  togetho’  with  the  signal 
indicatkiiis  and  point  positkms. 


Flgarc  1:  Examples  of  monitored  speed  corves  with  overlapping  restrictions 

The  actual  system  stracture  is  shown  in  Hgure  2.  The  stationary  computer  in  the 
statkxi  interiocking  as  well  as  the  computer  oa  board  of  each  vehicle  are  designed 
with  a  2  out  of  3  structure. 

Each  sub-conqwter  of  the  2  out  of  3  systems  contains  the  full  system  functionality 
including  all  the  necessary  hardware  and  software.  While  perfonning  the 
fimctioaality  oi  the  system  the  three  sub-computers  interchange  iiqnit  data  and 
final  results  of  their  calculations.  In  case  of  differmces  betwemi  the  results  of  the 
three,  the  one  with  the  difforing  data  gets  switched  off  by  the  two  others.  The 
remaining  sulxomputers  continue  as  a  2  out  2  system.  In  case  oi  an  additional 
fidlure,  the  system  shuts  down  entirely.  In  the  case  of  the  stationary  computer  this 
means  that  no  telegrams  are  transmitted.  In  die  case  of  the  on-board  computer  this 
leads  to  an  apidication  (tf  the  onergoicy  brake. 

There  is  course  a  large  number  of  additional  functions  like  mulatirai  of  the 
existing  ATP-system,  special  modes  for  depots  or  industrial  qwrs,  dnmting  mode, 
overriding  of  signals  at  danger  in  case  of  an  inimlocking  foilure,  excqition 
handling  ttc.  but  their  description  is  neither  needed  for  die  general  understanding 
of  die  system  design  nor  for  the  description  of  the  vorification  and  validation 
process. 


186 


Ftgwel:  Geaeral  System  Architecture 


3  Safety  Aspects  of  the  System  Design 

The  described  ATC-system  includes  cab  signalling  of  the  permitted  ^)eed.  At  a 
later  stage  oi  the  installatkn  and  usage  oi  the  system  it  is  intended  to  remove  the 
existing  conventianal  signalling.  This  leads  to  the  requirement  for  a  fail  safe 
system  design,  as  the  correct  observadcm  of  lineside  si^ials  by  the  driver  can  no 
longer  be  assumed.  Together  widi  the  required  functionality,  reliability  and 
redundancy,  die  ftdlowing  safety  (xmcept  was  chosen: 

•  A  2  oiu  of  3  system  structure  is  used  to  detect  hardware  failures  and  to  give 
die  required  level  oi  redundancy.  A  data  exchange  between  the  sub-computers 
allows  evaluation  oi  the  correct  functioning  of  the  hardware  including  all  the 
data  iiqwt  circuitry.  This  method  is  suppcxted  by  additional  self  tests.  Ouqwt 
chcuitry  fcv  safety  relevant  outputs  is  designed  based  on  convmtional  rules  for 
fail  safe  hardware  ( relays  etc. ).  Othowise,  the  use  hardware  conqionents 
designed  after  the  classical  rules  for  fail  safe  hardware  is  whoever  possiUe 
avoided.  This  allows  the  usage  oi  industry  standard  hardware  in  most 
components  and  eases  later  hardware  iqigrades. 

•  Data  transmissian  between  track  and  train  is  designed  taking  advanu^e  of  the 
2  om  (rf3  system  structure.  Each  of  the  three  sub  conqmters  in  the  interlocking 
generates  dtea  telegrams  inctudii^  data  encoding.  These  telegrams  are 
transmitted  by  each  of  the  three  sub-computers  in  a  cyclic  manner.  The  on¬ 
board  computers  require  the  recqjdon  oi  at  least  two  identical  telegrams  from 
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two  different  somces  to  accqx  the  received  data.  This  procedure  tolerates  the 
transmissim  of  incorrect  telegrams  by  one  sub-con^ter.  It  is  also  immune 
against  the  falsification  of  ^gle  telegrams  into  different  but  correctly 
encoded  telegrams. 

•  The  necessary  correctness  of  the  system  software  is  ensured  by  the  strict 
apfdication  oi  the  chosoi  norms  and  by  rigid  testing.  This  inclu^  modem 
testing  methods  and  procedures  like  for  example  static  st^are  analysis  and 
path  coverage  analyses. 

•  Data  entry  of  safety  relevant  data  like  train  characteristics  is  secured  by  the 
chosen  data  entry  procedure.  This  procedure  includes  the  echoing  of  the 
altered  data  aftn  validation  by  the  three  sub-computos.  This  echo  gets 
diqriayed  via  an  indq)aident  hardware  channel  for  verificatiMi  and 
confirmation  by  the  train  driver.  This  procedure  additionally  allows  the 
detection  of  data  disuxtitxi  or  hardware  faihires  in  the  data  entry  channel. 

•  The  correctness  of  the  route  nuf)  is  ensured  with  several  independent 
procedures.  Hrst,  data  collection  is  perftxmed  with  the  aid  of  a  computer 
based  data  collection  software.  This  serftware  ensures  systematic  working 
procedures  by  forcing  the  user  to  generate  and  to  update  the  data  base 
following  well  defined  paths.  Second,  the  data  collectirm  software  performs  a 
whole  set  of  data  verificadmi  procedures.  This  includes  boundary  checking  of 
entered  data  elements  and  data  ccxisistency  testing  tfier  data  entry.  The 
st^are  automatically  generates  indices  for  new  data  releases  to  ensure  the 
exclusive  use  of  valid  data.  Once  a  new  versiai  of  the  data  base  has  been 
generated  a  back  transformaticMi  software  regenerates  a  route  map  for  an 
additional  visual  data  verificaticMt 

4  Norms  and  Standards  used  for  System  Verification 
and  Validation 

Any  new  development  has  to  consider  the  requirements  set  by  the  caning  of  the 
European  Market  to  international  competitioii.  This  includes  of  course  norms  and 
standards  required  for  verification  and  validation.  One  of  the  major  problems  in 
this  field  is  die  current  lack  of  hamumisation  of  standards. 

In  several  European  countries,  national  railway  companies  and  industry  have  in  the 
past  developed  local  norms,  standards  or  nilebooks.  Some  of  these  are  based  on 
long  tnm  experience  with  conventional  relay  based  technology  or  with  electronic 
hardware. 


Srane  nonns  for  electronic  systems  including  micrcq;MOcessors  and  software  are 
cmnently  under  developmoit,  but  they  follow  diffinent  anxoaches  and 
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phUosophies.  This  depends  on  the  country  from  which  they  originate  and  on  the 
understanding  and  the  definition  of  die  term  safety  on  n^ch  they  are  based. 

A  study  was  carried  out  to  compare  the  currently  used  norms  in  Great  Britain, 
France  and  Germany  with  the  ones  used  in  Switzeriand.  The  study  also  had  a  lode 
at  the  proposals  in  work  in  the  differmit  committees  of  the  CEN/CENELEC.  This 
conqnrison  sliowed  that  bigger  differences  between  the  philosophies  of  the 
proposed  standards  still  have  to  be  overcome.  As  a  resuh  of  die  study,  the 
fidtowing  norms  were  chosen  as  a  base  of  the  verificatkm  and  validation  plan: 

•  the  British  norm  RIA  23  [1],  which  was  based  on  the  Draft  DEC  SC65A  WG9, 
for  the  implementation  of  the  safety  analyses  and  the  risk  assessment, 

•  the  proposed  DIN  norm  DIN  V  192S0  [2]  fw  the  evaluatirai  of  the  required 
safety  level 

•  and  the  pir^xised  DIN  norm  DIN  V  VDE  0801  [3]  for  the  definitimi  of  the 
further  implementatioiL 

In  the  mean  time  the  wmk  groups  WGAl  and  WGA2  of  the  CENELEC  committee 
SC9XA,  which  is  re^xmsible  for  the  proposal  of  a  new  norm,  chose  two  lEC 
papers  [4],  [5]  as  a  1^  of  their  future  standardisation  work.  These  papers  take 
reference  to  die  chosen  DIN  iKxms. 


S  The  Verifleation  and  Validation  Plan 

Based  on  the  chosen  norms  a  verification  and  validatirai  plan  was  defined.  This 
was  draie  in  close  co-operati<xi  with  the  Swiss  Federal  Bureau  of  Transportation, 
which  will  be  reqxxisible  for  the  final  system  iqiproval.  The  verification  and 
validatkm  plan  covers  the  following  topics: 


•  project  management 

•  development 

•  commisskming 

•  maiittmiance 

•  documentation 


configuration  management. 
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For  each  q(  the  above  mentkmed  to|HCS  individual  clu4)lers  cover  the  fc^wing 

topics: 

•  Norms  and  standards  to  be  applied  for  each  oi  the  defined  steps. 

•  Definition  of  a  detailed  plan  for  the  verificatkm  and  validatirm  of  each  step  to 
leadi  system  ipproval. 

•  Definition  oi  the  perstms  or  offices  lespcmsible  for  the  imi^ementatioii, 
verification  and  validatkm  each  of  the  defined  steps. 

•  Definition  of  the  re^xmsibilities  of  each  of  the  participating  persms  or  offices 
to  perform  the  above  mentioned  steps. 

Figure  3  shows  as  an  example  the  pix^x)sed  plan  for  the  system  develqmiem: 
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Figure  3:  Verification  and  Validation  Plan 


Each  stq>  ( work  flow  )  of  the  development  as  shown  in  the  (dan  above  is  verified 
with  the  defined  rules  and  accrmling  to  the  chosen  norms  of  the  verification  and 
validation  plan  and  documented  in  an  individual  paper.  Each  of  these  prgiers  is 
presented  individually  to  the  siqiervisoiy  authority  for  approval.  This  pixxxdute 
allows  an  early  detecticm  (rf  disagreemou  about  the  contents  of  die  safety 
certification  and  also  fast  certification  afta  final  testing,  evmi  in  the  case  of  such  a 
fairly  comidex  ATC-system. 
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6  Examples  of  VerificatioD  Procedures 

This  chiqpter  gives  a  moie  detailed  look  iiuo  two  of  the  large  number  of  differou 
areas  of  system  verification.  These  examples  got  chosen  to  show  the  spectrum  of 
measures  necessary  to  fully  implemou  a  verificatkm  and  validation  plan. 


6.1  Example  1:  Verification  of  the  System  Spedfication 

The  first  example  shows  the  safety  analyses  that  were  carried  out  to  verify  the 
system  specification.  It  is  based  on  a  fault  tree  analyses  of  the  current  operations  at 
tte  partkqnting  railway  coaqumies.  The  fault  trees  were  verified  with  the 
available  accident  statistics  and  widi  a  review  widi  railway  experts  from  difierait 
branches. 

Figure  4  shows  as  an  exanq)le  a  fraction  of  me  of  the  fault  trees. 
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Figure  4:  Example  of  Fault  Tree  Analyses 


The  differatt  possiUe  accidents  deducted  from  this  fault  tree  analyses  got 
weighted,  based  on  the  probability  oi  occurrence  oS  the  different  causes  leading  to 
diese  accidents  and  the  scale  ci  the  damage  that  could  likely  result  This 
procedure  is  based  on  the  model  of  RIA  23  [1],  which  again  is  based  on  the  Draft 
IBC  SC6SA  WG9  [4].  Itkd  to  taUes  like  the  one  shown  in  Hgure  S. 
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VTs  202  A  209  A212  VlOs  202  A210A213 

V8-  202A  209  A213  Vlls  202  A211A212 

y9  -  202  A210A2U  V12s  202  A211A213 

FigHre  S:  Example  of  Failure  Clamificatioa 


The  noim  DIN  V  VDE  0801  [3]  comains  more  detailed  niles  for  the 
imirieiiiaitation  of  the  verification  and  validation  (dan  conqiared  to  the  RIA  23  [1] 
and  got  therefore  chosen  as  a  base  for  further  steps.  To  allow  this,  a  transformation 
cf  the  required  safety  level  via  DIN  V  19250  [2]  got  undenaken.  Some 
assumptions  based  on  this  norm  had  to  be  made.  As  the  possible  sevoity  of 
accidents  with  trains  in  suburban  traffic  is  limited,  catastrofdc  events  were  defined 
to  corre^Nind  to  S3.  The  probability  remote  was  assigned  to  W2.  These  assumption 
led  with  die  definition  of  a  fiequent  to  permanent  crdlecdve  stay  within  the 
systems  q>here  to  a  safety  requiremmit  level  6. 

With  assumptions  on  the  possible  effect  oi  an  additional  safety  system  on  die 
probability  and  the  resulting  damage  of  accktotts,  the  system  qiecification  was 
verified.  This  verification  also  led  to  die  definition  of  die  required  integrity  level 
for  die  system  according  to  die  chosen  norm  [2]. 


6J2  Example  2:  Software  Imptementation 

The  second  exanqde  gives  an  overview  of  the  procedures  defined  for  the 
inqileniemation  and  the  testing  of  the  software. 
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As  already  mentioned  each  step  ( woik  flow  )  (tf  the  development  as  shown  in  the 
ooneqxmding  verification  and  validation  plan  is  perfonned  and  verified  according 
to  a  reviewed  and  iqiproved  guidelme.  The  guideline  for  software  implementation 
covers  the  ftdlowing  u^cs: 


•  Definition  of  the  diosen  programming  language  ( in  this  case  Modula  2 ).  This 
includes  rules,  restrictions  and  reccmunendations  for  the  use  of  language 
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elments  which  could  lead  to  difficulties  in  the  verification  of  the  code,  like 
points,  variant  records  etc.. 


•  Definitkm  of  the  structure  and  the  look  (tf  source  files  to  allow  several 
programmers  to  share  code  without  the  risk  of  misunderstanding  or  miss- 
interpretations  due  to  personal  style. 

•  Definititms  for  the  naming  and  the  marking  of  language  elements  like 
variables,  (xmstants,  subroutines  etc.  for  the  same  reasons. 


•  Definiti<»s  for  the  required  documentation  of  program  code  including  style, 
dq)th,  volume  etc.. 

•  Definitirm  of  the  procedures  and  ttxds  used  to  verify  the  correctness  of  the 
code  including  ch^k  lists.  The  tools  used  include  for  example;  syntax  check 
of  die  source  code  inclu^g  the  above  mentitmed  rules  and  restrictions,  static 
code  analyses  including  software  metrics  and  padi  coverage  analyses  including 
the  definition  of  the  used  totds  and  the  registration  of  test  cases  and  test 
results. 


7  Conclusions  and  Acknowledgements 

The  experimice  gained  in  the  development  of  the  described  ATC-system  in  the  last 
three  years  is  summarised  in  the  following  list: 

•  As  long  as  there  is  no  harmonisation  and  standardisation  for  the  verification 
and  validatkxi  of  computo'  based  safety  systmns  in  railway  tqiplicatimis  in 
Europe,  a  selection  of  the  norms  cunmitly  under  developmem  has  to  be  made 
individually  for  new  projects.  Once  a  luumonisation  has  been  reached  the 
transition  time  for  die  iqtplication  of  die  new  norms  will  be  very  short 
compared  to  the  developmmt  cycle  of  a  largor  project  It  is  dieiefore  advisable 
to  already  today  use  norms  that  will  probably  closely  cmtform  to  the  new 
norms.  Existing  norms  are  only  apfdicable  in  scmie  countries  and  usually 
rqiresent  a  country  iqiecific  approach  to  safety.  They  are  often  not  applicaUe 
in  other  countries. 
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•  A  vaificati(xi  and  validation  plan  has  to  be  defined,  a{H?roved  and  followed 
firom  the  very  beginning  of  a  new  develojxnent  to  avoid  later  problems,  delays 
and  cost  for  system  i^^mval. 

•  The  experience  shows  that  with  the  applicaticm  of  the  above  menticxied  plan  a 
development  can  get  carried  out  in  a  shraier  time  due  to  a  permanent  co- 
wdinaticm  between  developer,  customer  and  superviscay  autlKxity. 

•  The  experience  also  shows  that  the  applicatitm  of  the  above  menticmed  plan 
allows  system  validatim  at  a  significantly  lower  cost  due  to  an  early 
involvement  of  the  supervisory  authwity.  Wishes  and  requirements  for 
changes  can  get  respected  at  the  £q^)roiMiate  level  of  the  development  cycle. 
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ABSTRACT 

Randomly  generated  software  tests  are  an  established  method  of  estimating  soft¬ 
ware  reliability  [5,  7].  But  as  software  applications  require  higher  and  higher 
reliabilities,  practical  difficulties  with  random  testing  have  become  increasingly 
problematic.  These  practical  problems  are  particularly  acute  in  life-critical  appli¬ 
cations,  where  requirements  of  10“^  failures  per  hour  of  system  reliability  trans¬ 
late  into  a  probability  of  failure  (pot)  of  perhaps  10“®  or  less  for  each  individual 
execution  of  the  software  [4]-  We  refer  to  software  with  reliability  requirements 
of  this  magnitude  as  ultra-reliable  software. 

This  paper  presents  a  method  for  assessing  the  confidence  that  the  software 
does  not  contain  any  faults  given  that  software  testing  and  software  testability 
analysis  have  been  performed.  In  this  method,  it  is  assumed  that  software  testing 
of  the  current  version  has  not  resulted  in  any  failures,  and  that  software  testing 
has  not  been  exhaustive.  In  previous  publications,  we  have  termed  this  method 
of  combining  iesiability  and  testing  to  assess  a  confidence  in  correctness  as  the 
“Squeeze  Play”  and  “Reliability  Amplification,”  [15,  13]  however,  we  have  not 
formally  developed  the  mathematical  foundation  for  quantifying  a  confidence  that 
the  software  is  correct.  We  do  so  in  this  paper. 


1  Introduction 

The  probability  of  failure  of  a  program  is  conditioned  on  an  input  distribution. 
(Another  term  for  an  input  distribution  is  "operational  profile”  [14].)  An  input 
distribution  is  a  probability  density  function  that  describes  for  each  legal  input 
the  probability  that  the  input  will  occur  during  the  use  of  the  software.  Given 
an  input  distribution,  the  probability  of  failure  {pof)  is  the  probability  that  a 
random  input  drawn  from  that  distribution  will  cause  the  program  to  output  an 
incorrect  response  to  that  input. 

Software  reliability  is  defined  as  the  probability  of  failure-free  operation  of 
the  software  in  a  fixed  environment  for  a  fixed  period  of  time.  Note  the  differ- 
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ence  between  the  definition  for  reliability  and  that  for  software  probability  of 
failure;  probability  of  failure  is  time  independent.  However,  both  reliability  and 
probability  of  failure  are  tied  to  a  specific  environment. 

Even  if  ultrai-reliable  software  can  be  in  theory  achieved,  we  cannot  com¬ 
fortably  depend  on  this  achievement  unless  we  can  assess  its  reliability  in  a 
convincing,  systematic,  and  scientific  manner.  As  pointed  out  in  [2],  black-box 
testing  is  impractical  for  establishing  these  very  high  reliabilities.  In  general, 
by  executing  T  random  tests,  we  can  estimate  a  probability  of  failure  in  the 
neighborhood  of  1/T  when  none  of  the  tests  reveal  a  failure  [3].  If  the  required 
reliability  is  in  the  ultra-reliable  range,  random  testing  would  require  decades  of 
testing  before  it  could  establish  a  reasonable  confidence  in  this  reliability,  even 
with  the  most  sophisticated  hardware.  Based  on  these  impracticalities,  some 
researchers  contend  that  very  high  reliabilities  can  not  be  quantified  using  sta¬ 
tistical  techniques  [1].  In  dismissing  all  possible  statistical  techniques  because 
of  the  practical  problems  with  random  testing,  we  think  Butler  and  Finelli  eire 
being  premature,  and  this  paper  describes  how  a  statistical  technique  in  ad¬ 
dition  to  random  testing  may  be  brought  to  bear  on  the  problem  of  assessing 
ultra-reliability. 

Two  purposes  of  software  testing  are  establishing  a  reliability  estimate  and 
finding  software  faults.  When  software  does  not  fail  during  non-exhaustive  test¬ 
ing,  there  is  good  news  and  bad  news.  The  good  news  is  that  we  suspect  that  the 
software  no  longer  has  gross  faults.  The  bawl  news  is  that  testing  is  no  longer  as 
effective  at  estimating  reliability  or  at  uncovering  the  remaining  faults  (if  they 
exist). 

It  is  disheartening  to  realize  that  it  is  more  difficult  to  assess  the  reliability 
of  a  program  that  has  not  failed  than  it  is  to  assess  the  reliability  of  a  program 
undergoing  some  proportion  of  failures.  Previous  work  has  tackled  the  problem 
of  assessing  the  probability  of  failure  of  software  that  has  not  failed  [3].  In  this 
paper,  we  consider  a  closely  related  problem  using  an  approach  that  is  distinct 
from  traditional  testing.  We  are  interested  in  the  confidence  that  the  software 
is  correct  given  that: 

1.  the  program  has  not  failed  in  T  tests,  and 

2.  we  have  a  prediction  of  the  minimum  non- zero  failure  probability  (from 
testability  analysis  that  has  been  performed  on  the  program). 

We  agree  that  testing,  on  its  own,  cannot  be  used  to  establish  ultra-reliabilities. 
However,  testing  is  not  the  only  statistical  technique  possible  for  analyzing  soft¬ 
ware  reliability.  We  believe  that  software  testability,  as  a  quantifiable  measure 
of  software  quality,  can  be  used  in  conjunction  with  testing  to  assess  reliability. 

2  Software  Testability  Analysis 

We  now  discuss  “sensitivity  analysis,”  a  statistical  technique  complementary  to 
testing.  Sensitivity  analysis  is  one  algorithm  for  performing  testability  auialysis. 
Used  in  conjunction  with  testing,  sensitivity  analysis  may  allow  us  to  estimate 
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reliability  to  a  much  higher  precision  than  was  possible  with  testing  alone.  A 
preliminary  model  for  doing  this  reliability  assessment  was  previously  presented 
at  [13]. 

Sensitivity  analysis  uses  program  mutation,  data  state  mutation,  and  re¬ 
peated  executions  to  predict  a  minimum  fault  size  [6].  The  minimum  fault  size 
is  the  smallest  probability  of  failure  likely  to  be  induced  by  a  programming  er¬ 
ror  (with  respect  to  the  program,  testing  distribution,  and  simulated  faults  that 
are  injected).  Sensitivity  analysis  does  not  use  an  oracle,  and  can  therefore  be 
completely  automated  as  implemented  by  the  PxSCES  tool  [18]. 

Testing  establishes  an  upper  limit  on  the  software’s  probability  of  failure; 
sensitivity  analysis  establishes  a  lower  limit  on  the  probability  of  failure  that 
is  likely  to  occur.  Together,  the  estimate  and  the  prediction  can  be  used  to 
establish  confidence  that  software  does  not  contain  any  faults. 

Sensitivity  analysis  is  based  on  separating  software  failure  into  three  phfises: 
execution  of  a  software  fault,  creation  of  an  incorrect  data  state,  and  propagation 
of  this  incorrect  data  state  to  a  discernible  output.  This  three  part  model  of 
software  failure  [8]  will  be  referred  to  as  PIE,  for  Propagation,  Infection,  and 
Execution.  In  this  paper  we  examine  how  to  apply  PIE  to  the  task  of  finding 
a  realistic  minimum  non-zero  probability  of  failure  prediction,  a,  when  random 
testing  has  discovered  no  errors. 

If  a  location  contains  a  fault,  and  if  the  location  is  executed,  the  data  state  of 
the  execution  may  or  may  not  be  changed  adversely  by  the  fault.  If  the  fault  does 
change  the  data  state  into  a  data  state  that  is  incorrect  for  this  input,  we  say  the 
data  state  is  infected.  To  predict  the  probability  of  infection,  the  second  phase 
of  sensitivity  analysis  performs  a  series  of  syntactic  mutations  on  each  location 
[9].  After  each  mutation,  the  program  is  re-executed  with  random  inputs;  each 
time  the  monitored  location  is  executed,  the  data  state  is  immediately  compared 
with  the  data  state  of  the  original  (unmutated)  program  at  that  same  point  in 
the  execution.  If  the  state  differs,  infection  has  taken  place  [12]. 

The  third  phase  of  the  analysis  estimates  propagation.  Again  the  location  in 
question  is  monitored  during  random  tests.  After  the  location  is  executed,  the 
resulting  data  state  is  changed  by  assigning  a  random  value  to  one  data  item 
using  a  predetermined  distribution.  (Research  is  ongoing  as  to  the  best  distribu¬ 
tion  to  use  for  this  random  selection.  Current  experiments  use  an  equally  likely 
distribution  over  the  range  of  values  for  this  variable  during  random  testing.) 
After  the  data  state  is  changed,  the  program  continues  executing  until  an  output 
results.  The  output  that  results  from  the  changed  data  state  is  compared  to  the 
output  that  would  have  resulted  without  the  change.  If  the  outputs  differ,  error 
propagation  has  occurred. 

Each  phase  produces  a  probability  estimate  based  on  the  number  of  trials 
divided  by  the  number  of  events  (either  execution,  infection,  or  propagation). 
Execution,  infection,  and  propagation  must  all  occur  to  result  in  a  failure  at 
this  location.  Thus  the  product  of  these  estimates  yields  an  estimate  of  the 
probability  of  failure  that  would  result  if  this  location  had  a  fault. 

Sensitivity  analysis  is  a  new  empirical  technique.  Since  sensitivity  analysis 
docs  not  require  an  oracle,  it  can  be  completely  automated.  Preliminary  results 
of  the  accuracy  of  the  lower  bounds  produced  by  sensitivity  analysis  have  been 
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encouraging  [9,  11,  10]. 

Sensitivity  analysis  produces  3  probability  estimates  for  each  location  exam¬ 
ined.  In  order  to  assess  testability  for  a  program  or  module,  we  must  be  able 
to  give  one  probability  prediction:  the  “latent  failure  rate.”  The  latent  failure 
rate  is  a  prediction  of  the  minimum  non-zero  failure  probability  that  can  occur 
from  any  possible  fault  in  the  program  given  the  fault  classes  that  were  sim¬ 
ulated  by  sensitivity  analysis.  It  is  possible  that  an  actual  fault  will  induce  a 
lower  failure  probability  than  the  latent  failure  rate.  This  occurs  when  the  fault 
classes  simulated  by  SA  do  not  include  a  particular  fault,  and  that  fault  has  a 
lower  failure  rate  associated  with  it  than  any  of  the  faults  that  are  simulated. 
This  is  an  unavoidable  possibility  that  is  incurred  by  all  fault-based  techniques; 
fault-based  techniques  are  only  as  powerful  as  the  range  of  the  fault  classes  that 
they  simulate.  However,  in  the  experiments  cited  above,  such  “surprisingly  small 
faults”  were  rare. 


3  Hoeffding’s  Inequality 

When  we  test  a  program  randomly,  success  is  defined  by  whether  or  not  it  per¬ 
forms  correctly  on  all  tests.  However,  since  test  cases  are  chosen  randomly,  there 
is  a  certain  probability  of  drawing  a  test  sequence  that  fools  us  into  believing 
that  a  “bad”  program  is  “good.”  An  extreme  example  is  the  case  in  which  the 
same  test  case  is  drawn  N  times,  an  unlikely  but  possible  outcome  of  N  draws 
with  replacement.  Although  we  can  honestly  say  in  this  case  that  we  have  per¬ 
formed  N  random  tests,  in  truth  our  testing  has  told  us  little  about  the  quality 
of  the  program. 

Our  analysis  assumes  that  sampling  is  with  replacement.  However,  it  is  nor¬ 
mally  not  likely  that  we  will  draw  the  same  test  N  times,  and  we  might  hope 
that  there  is  only  a  small  probability  of  drawing  any  sort  of  grossly  unrepresen¬ 
tative  set  of  tests.  In  fact,  we  can  bound  this  probability  as  follows:  we  first 
note  that  if  the  probability  of  a  program  failure  is  e,  then  the  prob,  bility  that 
it  will  fail  in  exactly  k  out  of  N  tests  is  given  by  a  binomial  distribution: 

Therefore,  the  probability  that  the  program  will  not  fail  on  any  test  is  obtained 
by  setting  ib  =  0;  (1)  then  becomes  (1  —  e)^,  the  confidence  bound  that  we  have 
already  used.  To  state  this  result  in  mathematical  terms,  let  ^  be  an  empirical 
estimate  for  the  probability  of  an  event  (in  this  case  the  event  is  a  program 
failure),  which  is  constructed  by  dividing  the  number  of  times  the  event  occurred 
by  the  number  of  tests  that  were  made.  Also  let  i/  be  the  true  probability  of  the 
event.  Then  we  have  just  shown  that 

Pr(|^-i/|>  c)<(l-0'^ 

given  that  (j)  is  0.  In  other  words,  we  have  shown  that  the  difference  between 
the  true  probability  of  failure  and  the  estimated  probability  of  failure  is  greater 
than  or  equal  to  f  only  with  probability  (1  —  e)^. 
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In  testability  analysis  we  are  faced  with  a  problem  that  is  similar  to  the  one 
we  face  in  testing.  We  obtain  an  empirical  estimate  of  the  probability  of  some 
event  (in  this  case  the  event  is  no  longer  a  program  failure  but  a  propagation, 
an  infection,  or  an  execution)  and  we  wish  to  know  how  likely  it  is  that  we  have 
drawn  an  unrepresentative  sequence  of  tests,  and  thus  obtained  an  estimate  that 
is  actually  far  off  the  mark.  We  require  an  upper  bound  on 

Pr(|  ^  -  1/  |>  c) 

that  applies  in  the  general  case,  and  not  just  when  <i>  =  0.  One  such  bound  was 
given  in  [17]  and  is  commonly  known  as  Hoeffding’s  inequality.  It  states  that 

Pr(|^-i/|>e)<2e-2^'^. 

(Note  that  this  is  only  an  upper  bound.  Even  though  Pr(|  <f)  —  i/  \>  e)  is  a, 
probability  and  cannot  be  greater  than  1,  the  right  side  of  (2)  can  be  as  large 
as  2.  But  if  2e"^‘  ^  were,  say,  1.6,  then  (2)  would  not  tell  us  anything  new, 
because  we  already  knew  that  Pr(|  ^  —  i/  |>  e)  is  less  than  or  equal  to  1.0,  and 
hence  less  than  1.6.  In  that  case,  however,  (2)  would  not  give  us  any  confidence 
that  \  <l>  —  1/  \  was  less  than  e,  because,  as  far  as  we  could  ascertain  by  looking 
at  (2),  the  probability  that  \<i>  —  i'  \>(  could  be  as  large  as  1.0.) 

For  the  purposes  of  quantifying  our  confidence  in  a  latent  failure  rate,  we 
apply  Hoeffding’s  inequality  and  get  that 

PrxD[|  |>  c]  <  (1) 


where 

1.  X  is  the  space  of  mutants  and  perturbation  functions  used,  i.e.,  X  is  the 
space  of  programs  based  on  P  for  which  D  wa«  used  during  testing. 

2.  1/  is  the  true  latent  failure  rate  (unknown). 

3.  is  the  predicted  (empirical)  latent  failure  rate  found  via  sensitivity  anal¬ 
ysis. 

4.  e  is  the  “fudge  factor”  that  we  set  for  how  much  we  believe  we  have  mis¬ 
calculated  <!>. 

For  £,  our  intuition  suggests  the  use  of; 

c  =  *-(10-‘  .J)  (2) 


Equation  2  provides  a  fudge  factor  of  one  order-of-magnitude;  although  e  is 
independent  of  <f>^  it  makes  sense  that  c  is  near  ^’s  order-of-magnitude.  As  —2e^N 
decreases,  2e"^'  ^  increases;  this  undesirable  situation  will  be  fully  explained 
later.  For  now,  it  is  enough  to  know  that  c  is  the  dominant  argument  in  2e“^' 
and  for  a  small  t,  N  must  be  enormous  for  2c“^*  ^  to  be  small.  (Tables  1  and 
2  show  the  relationship  between  the  parameters  of  Equation  4.) 
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4  The  ‘‘Squeeze  Play” 

As  we  have  explained  in  Section  2,  sensitivity  analysis  provides  an  empirical 
prediction  of  the  minimum  non-zero  failure  probability.  And  from  testing  we 
have  an  upper  bound,  6,  on  the  probability  of  failure.  If  the  upper  and  lower 
bounds  hold,  we  have  bracketed  the  true  probability  of  failure.  Note  that  both 
the  prediction  and  estimate  must  be  based  on  the  input  distribution  D. 

To  understand  what  this  prediction  and  this  estimate  tell  us,  it  is  often  easier 
to  visualize  these  probabilities  as  fault  sizes,  meaning  the  upper  bound  can  be 
viewed  as  the  largest  fault  that  ts  likely  to  still  be  remaining  after  testing  T 
times,  and  the  latent  failure  rate  can  be  viewed  as  the  smallest  fault  that  is 
likely  to  still  be  in  the  code.  Again  it  is  important  to  remember  that  we  do  not 
know  whether  any  faults  exist  in  the  code. 

If  we  have  done  enough  testing  to  be  very  confident  that  the  true  probability 
of  failure  is  less  than  the  smallest  fault  we  expect  to  see,  then  we  believe  that 
there  are  no  remaining  faults  with  respect  to  the  fault  classes  that  were  simu¬ 
lated  during  sensitivity  analysis  using  D.  Section  5  presents  the  mathematical 
confidence  that  we  have  in  this  belief.  This  idea  of  increased  testing  to  increase 
our  confidence  that  the  true  pof  is  less  than  <j>  is  termed  the  “Squeeze  Play.” 
[15,  13].  The  same  effect  can  be  realized  by  increasing  iff  via  either  software 
design-for-testability  or  removing  those  PIE  estimates  from  the  latent  failure 
rate  that  are  driving  <t>  down.  We  can  confidently  ignore  PIE  probability  es¬ 
timates  only  when  are  sure  that  the  location  that  they  are  associated  with  is 
correct.  After  all,  if  we  know  there  is  no  fault  at  a  location,  then  we  do  not 
care  what  the  predicted  ability  of  that  location  is  to  hide  a  fault  from  us,  since 
none  is  there.  Such  certainty  for  very  small  portions  of  code  might  be  possible 
in  isolated  cases  using  formal  methods  such  as  proof  of  correctness.  Since  such 
proofs  are  more  easily  carried  out  for  small  code  segments,  the  use  of  sensitivity 
analysis  may  encourage  future  applications  of  formal  techniques  often  dismissed 
as  impractical  today. 


5  Quantifying  Absolute  Correctness 

Testing  down  to  the  level  of  ^  —  e  as  well  as  performing  testability  analysis 
for  <j>  are  processes  that  are  statistically  subject  to  error.  As  we  have  shown, 
Hoeffding’s  inequality  provides  a  mechanism  for  quantifying  this  error. 

Hamlet’s  probable  correctness  model  [16]  provides  a  mechanism  for  quanti¬ 
fying  how  likely  testing  is  to  give  us  a  confidence  that  the  true  probability  of 
failure  is  greater  than  6: 

Prob[eD  >  («))  =  (1  -  wf  (3) 

where  the  T  successful  tests  are  selected  according  to  distribution  D.  This 
allows  us  to  determine  how  much  confidence  we  place  in  the  results  of  testing 
the  program  T  times  successfully. 

The  following  equation  gives  us  a  lower  bound  on  our  confidence  that  the 
software  is  correct.  If  the  lower  bound  is  less  than  zero,  this  means  that  we  have 
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zero  confidence. 

Confidence[©o  =  OJxd  =  1— 

1(1 -(«))’• +2e-’‘’«l  (4) 

There  are  several  aspects  of  Equation  4  that  are  noteworthy.  First,  in  the 
case  where  B  <<f>,  the  size  of  0  impacts  our  confidence  in  correctness  given  fixed 
T  and  N.  Equation  4,  then,  suggests  that  just  having  a  cross-over  of  0  and  ^ 
is  not  enough;  we  also  need  a  sufficiently  large  <f>.  Second,  as  ^  decreases,  the 
required  N  to  overcome  a  small  c  in  order  for  2e“^‘  ^  to  be  near  zero  becomes 
virtually  intractable.  Thus  if  we  are  to  have  any  confidence  that  our  software  is 
correct  we  will  need  2c“^‘  ^  to  be  near  zero  and  (1  —  (0))^  to  be  near  zero. 


6  Conclusions 

We  contend  that  the  preliminary  results  of  experiments  in  software  sensitivity  are 
sufficient  to  motivate  research  into  quantifying  sensitivity  analysis.  Although  the 
technique  will  likely  require  revision,  the  ideas  that  motivate  sensitivity  analysis 
dispute  the  contention  that  random  testing  is  the  only  method  of  experimentally 
quantifying  software  reliability.  We  cannot  guarantee  that  this  new  technique 
will  make  it  possible  to  assess  reliability  to  the  precisions  required  for  life-critical 
software.  However,  we  do  think  it  is  premature  to  declare  such  an  assessment 
impossible.  In  the  preceding  sections  we  have  argued  that  if  testability  predic¬ 
tions  can  be  quantified  accurately,  then  it  is  plausible  to  combine  random  testing 
results  with  testability  results  to  assess  reliability  more  precisely  than  is  possible 
with  testing  alone. 

Both  random  testing  and  testability  analysis  gather  information  about  pos¬ 
sible  probability  of  failure  values  for  a  prograun.  However,  the  two  techniques 
generate  information  in  distinct  ways;  random  testing  treats  the  program  as  a 
single  monolithic  black  box  while  sensitivity  analysis  examines  the  source  code 
location  by  location;  random  testing  requires  an  oracle  to  determine  correctness 
but  sensitivity  analysis  requires  no  oracle  because  it  does  not  judge  correctness; 
testing  that  reveals  no  failures  focuses  on  the  possibility  of  no  faults  existing 
while  sensitivity  analysis  focuses  on  a  "what  if  a  fault  exists  in  this  location” 
analysis.  The  two  techniques  provide  independent  data  about  how  frequently 
the  program  should  fail  if  any  faults  exist. 

This  paper  has  primarily  focused  on  the  case  where  9  <<f>,  because  this  is  the 
easiest  situation  in  which  to  gain  confidence  in  the  software’s  correctness.  And 
as  Tables  1  and  2  have  shown,  it  is  preferable  that  9  <  <l>.  Unfortunately,  this 
situation  may  be  infrequent,  and  more  frequently  9  >  <f>.  This  situation  makes 
it  almost  impossible  to  provide  a  confidence  in  the  absolute  correctness  of  the 
code.  Two  main  strategies  can  push  9  and  ^  closer:  decreasing  9  by  increasing 
T,  the  number  of  tests;  or  increasing  <f>  by  rewriting  code  locations  that  have 
low  sensitivity.  Software  design-for-testability  is  one  avenue  of  research  that  we 
are  exploring  that  we  hope  will  generally  cause  systems  to  have  higher 

One  interesting  side-effect  of  Equation  4  has  revealed  that  not  all  cross-over 
cases  are  equivalent  in  the  derived  confidence.  This  is  because  as  e  decreases 
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0.001 
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0.009955 
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0.095 
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0.09 
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0.09 
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0.00001 
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Table  1:  Various  Parameters  for  Equation  4 
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0.1 

Table  2:  Additional  Parameters  for  Equation  4 


(when  <t>  increases),  the  required  N  to  overcome  this  deficiency  becomes  in¬ 
tractable.  Thus  we  have  discovered  a  new  argument  for  attaining  a  higher 
not  only  does  a  higher  <f>  require  fewer  T  tests  but  a  higher  means  a  higher  e 
which  a  greater  confidence  in  <t>  can  be  achieved  with  a  smaller  N.  This  is  cru¬ 
cial  to  the  success  of  Equation  4,  since  a  tiny  N  will  almost  certainly  destroy  any 
chance  of  getting  a  greater  than  zero  confidence  in  the  software’s  correctness. 
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Abstract 

An  approach  to  computer  support  organization  of  program  testing 
and  analysis  is  considered.  The  approach  is  based  on  a  semantic  net 
representation  and  usage  of  knowledge  about  a  program.  The  pos¬ 
sibilities  and  benefits  of  this  approach  application  in  different  kinds 
of  program  analysis  and  usage  of  the  Prolog  language  as  the  tool 
of  such  analysis  implementation  are  described.  Also  the  possibil¬ 
ity  of  the  approetch  spreading  over  different  program  representations 
analysis  and  other  problem  areas,  connected  with  the  program  en¬ 
gineering,  is  indicated. 


1  Introduction 

It  is  known  that  important  factors  of  software  reliability  and  safety  improve¬ 
ment  are  program  testing  and  analysis.  As  objects  of  studying  and  analysis  may 
serve  such  program  features  as  interrelations  of  program  objects,  properties  of 
ones,  control  flow,  data  flow  and  structure,  quality  characteristics,  results  of 
execution.  Models  traditionally  used  for  such  analysis  are  control  flow  graph 
(c-graph),  data  flow  graph,  call  graph.  The  offered  approach  envisages  various 
program  features  modeling  by  semantic  nets  and  frame  ones.  Using  of  these 
formalisms  of  knowledge  representation  allows  to  uniformly  represent  informa¬ 
tion  about  program  features  which  is  necessary  for  different  kinds  of  program 
analysis.  This  information  is  represented  by  a  set  of  facts  —  instances  of  rela¬ 
tions  (e.g.  relations  of  calls,  nesting,  declaration,  usage,  location,  precedence, 
characterization,  edge  coverage),  existing  or  arising  under  program  execution 
between  entities  of  program  (e.g.  procedures,  variables,  constants,  c-graph 
nodes)  or  between  ones  and  their  characteristics  (as  examples  of  the  latter  may 
serve  "coordinates”  of  location  in  the  program  for  procedures,  "coordinates” 
of  declaration  or  usage  for  variables  and  arrays,  quality  metric  values  for  pro¬ 
cedures  and  programs).  Facts  of  these  relations  simulate  a  state  (features)  of 
program  and  form  its  model;  program  analysis  actions  are  described  by  a  set  of 
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appropriate  rules  and  are  based  on  construction  and  analysis  of  such  models. 
The  main  tool  of  this  approach  implementation  is  Prolog. 

Different  papers  have  influenced  the  approach  formation.  First  of  all  these 
are  [1-5],  which  are  based  on  selection  and  usage  of  relations  between  program 
objects,  [6,  7],  in  which  some  aspects  of  the  Prolog  usage  for  program  analysis 
were  considered,  and  also  [8],  in  which  the  possibility  of  graphs  and  graph 
grammars  usage  for  program  representation  modeling  was  given. 

2  Representation  of  Knowledge  about  Program 
Features 

Discussing  the  modeling  of  program  features  it  is  useful  to  distinguish  between 
two  levels  of  knowledge  representation:  ’’user”  level,  associated  with  descrip¬ 
tion,  and  "system”  level,  connected  with  implementation.  On  the  former  level, 
which  corresponds  to  the  requirements  of  convenience  and  simplicity  of  knowl¬ 
edge  representation,  the  using  of  a  simple  semantic  net  formalism  is  preferable, 
on  the  latter  level,  which  takes  into  consideration  unambiguity  of  representa¬ 
tion  of  program  objects  in  the  model,  model  size  and  other  realization  aspects, 
frame  net  formalism  is  more  adequate  [9]. 

Semantic  net  formalism  is  based  on  the  idea  of  knowledge  representation 
in  the  form  of  the  oriented  graph  with  named  nodes  and  edges,  where  nodes 
correspond  to  the  objects  of  the  problem  area  studied,  and  edges  to  relation.s 
between  them.  There  is  a  sufficiently  large  number  of  kinds  of  semantic  nets. 
Let  us  limit  our  consideration  by  non-homogeneous  semantic  nets,  which  con¬ 
tain  different,  not  only  one  kind  relations,  and  simple  (non-hierarchical)  ones, 
whose  nodes  don’t  have  its  own  structure.  From  the  logic  view  the  main  func¬ 
tional  element  of  semantic  net  (relation  and  two  nodes  connected  by  it)  is 
equivalent  to  predicate  with  two  arguments.  For  this  reason,  semantic  net  may 
be  represented  both  in  a  graphic  form  and  in  a  predicate  form. 

^o,  knowledge  about  some  features  of  an  imaginary  Pascal  program,  con¬ 
taining  procedure  sample,  on  the  ’’user”  level  may  be  represented  by  the  next 
facts  of  semantic  net  relations  (in  infix  predicate  form): 

sample  uses-the-variable  index 
index  is-declared-in-line  52 
index  has-type  integer 
index  is-referred-to-in-line  56 
sample  has-length-in-lines  8 
sample  has-commentedness  0 
sample  has-complexity  2 

Frame  formalism,  which  also  implies  a  choice  of  objects  and  relations  of 
problem  area  and  in  a  certain  sense  may  be  considered  as  a  particular  case  of 
semantic  nets,  is  oriented  towards  the  representation  of  stereotype  situations. 
Frame  is  the  composition  of  situation  components  (slots)  having  its  own  name 
and  value  and  united  by  a  frame  name.  As  a  slot  value  may  be  a  name  (in- 
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dividual  label)  of  a  frame,  then  a  frame  set  turns  into  a  frame  net.  Like  the 
semantic  nets  frame  ones  may  be  expressed  in  graphic  and  symbolic  forms. 

So,  on  the  "system”  level  semantic  net  presented  above  may  be  transformed 
into  the  frame  set  (in  a  symbolic  form) 

(TYPE  object  index  procedure  sample  declaration-line  52  type  integer) 
(REFERENCE  object  index  line  56) 

(CHARACTERISTICS  procedure  sample  length-in-lines  8 

commentedness  0  complexity  2). 

Prolog  program  consists  of  facts,  rules  and  queries.  Facts  fix  the  existing 
of  some  relations  between  objects;  rules  set  common  dependencies  for  using 
relations  and  allow  to  get  new  facts  from  ones  taking  place;  queries  require  to 
confirm  the  existence  of  the  relation  between  concrete’  objects  or  point  out  the 
objects  connected  by  a  certain  relation  with  other  objects.  Objects  may  be 
represented  by  ones’  lists.  For  example,  the  frame  set  given  above  is  compactly 
represented  by  Prolog  facts  (here  and  below  syntax  and  terminology  [10]  are 
used) 

type  (index  sample  52  integer) 
reference  (index  56) 
characteristics  (sample  8  0  2), 

relationships  between  frames  are  set  by  Prolog  rules. 

Program  testing  and  analysis  actions  are  described  by  Prolog  rules.  Program 
state  models  are  analyzed  by  execution  of  Prolog  program,  which  includes  an 
appropriate  set  of  rules  and  queries. 

3  Contents  and  Ways  of  Forming  of  the  Model 

Apparently,  treating  program  features  modeling  it  is  necessary  to  answer  at 
least  three  important  questions:  WHAT,  HOW  and  WHY  must  be  represented 
in  the  model?  The  model  reflected  features  of  some  really  existing  program  is, 
nevertheless,  quite  independent  entity  with  own  internal  properties,  physically 
separated  from  modeling  program,  and  itself  is  a  possible  object  for  investi¬ 
gations.  Given  above,  the  explanations  of  the  common  idea  of  the  approach 
mainly  answer  a  question  HOW  (how  are  program  features  represented  in  the 
model?).  Now  let’s  try  to  answer  a  question  WHAT  (what  program  features 
must  be  represented  in  the  model?). 

Purpose  of  the  model  is  to  reflect  program  features  adequately  to  analysis  re¬ 
quirements  (having  as  experience  shows  tendency  to  permanent  growing).  For 
this  reason,  it  has  to  contain  maximum  possible  (but  at  the  same  time  practi¬ 
cally  acceptable)  volume  of  useful  information  about  various  program  features. 
Since  Prolog  allows  to  infer  new  knowledge  from  existing  one,  it  is  worthwhile 
to  reveal  some  "base”  (necessary)  volume  of  knowledge  about  a  program.  In 
a  certain  sense  contents  of  the  model  is  some  compromise  between  require¬ 
ments  of  analysis  execution  effectiveness  and  memory  economy.  Intuitively  it 
is  clear  that  from  program  analysis  and  testing  view  the  model  must  contain 
information  about  all  program  modules  and  their  relationships  (such  as  calls 
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and  parameter  transmissions),  about  all  places  and  contents  of  program  data 
objects’  declarations  and  usages,  about  control  structure  in  a  program,  about 
values  of  some  metrics  (e.g.  of  a  complexity),  about  coverage  of  program  paths 
by  tests,  etc. 

Information  about  program  module  relationships,  data  objects’  declarations 
and  usages  and  control  structure  may  be  extracted  from  program  source  code 
under  its  syntactic  parsing;  values  of  necessary  metrics  are  computed  on  the 
base  of  this  parsing  data;  test  run  result  data  are  obtained  by  instrumented 
program  execution. 

4  Dependencies  between  Elements  of  the  Model 

To  answer  (naturally,  not  exhaustively)  a  question  WHY  (why  just  such  kind 
of  information  must  be  represented  in  the  model?)  let’s  profit  by  mathematical 
apparatus  of  relations,  based  on  the  set  theory. 

Let  A  and  B  be  two  sets,  A  x  B  is  their  cartesian  product,  then  any  its 
subset  RC  Ax  B  defines  some  binary  relation  between  elements  of  A  and  B. 
Objects  (elements)  x  and  y  are  in  a  binary  relation  R  that  is  denoted  as  xRy, 
if  (*.  y)eR. 

Let’s  as  a  base  relation  set  take  the  next  collection; 

/2i(’’begins-at’’)  C  M  x  P, 

PaCterminates-at”)  C  A/  x  P, 

P3(’’is-directly-ne8ted-in’’)  C  Af  x  M, 

P4(’’is-called-at’’)  C  A/  x  P, 

P5(’’is-declared-aO  CVxP, 

PeCis-defined-at”)  CV  x  P, 

Pr^”  is- referred- to-at”)  CVxP, 

Pg(’’is-undefined-at’’)  CVxP, 

P9(’’starts-at’’)  CG  x  P, 

Pio(”ends-at’’)  C  G  x  P, 

PiiCis-directly-preceded-to”)  C  G  x  G, 
where  M,  V,  P,  G  are  non-intersected  sets  of  the  model  elements,  and  besides  M- 
elements  represent  program  modules  (relatively  independent  program  entities, 
such  as  head  modules,  subprograms,  procedures,  functions,  etc.),  Y-elements 
represent  data  objects  (e.g.  variables  and  arrays)  of  a  program,  P-elements  rep¬ 
resent  points  (placements)  in  a  program,  G-elements  represent  c-graph  nodes; 
sense  of  the  relations  is  clear  from  its  linguistic  meaning.  Not  concerning  con¬ 
crete  details  of  representation  of  program  objects  in  the  model,  we  shall  con¬ 
fine  ourselves  to  the  requirement  of  unambiguity  of  this  representation  and 
regularity  of  P-elements  for  the  possibility  of  its  comparison  (before-after,  less- 
greater),  that  may  be  ensured  by  their  number  nature. 

On  the  base  of  relations  Pi  -  Pn  and  also  common  order  relations  ’’not- 
greater-than”  (Pi?)  and  "not-less-than”  (Pis),  defined,  in  particular,  on  the 
set  P  X  P,  with  the  help  of  appropriate  operations  under  relations  it  is  possible 
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to  define  some  auxiliary  relations,  promoting  to  the  inference  of  existing  depen¬ 
dencies  between  model  elements  (and  accordingly,  between  program  objects). 
So,  with  the  help  of  relations 

iei4("i8-u8ed-at’’)  CVxP, 
f^is^^is-nested-in”)  C  Af  x  M, 

Aie^^belon^to-the-module”)  C  P  x  M, 
fii7(”i»-declared-in-the-module’’)  C  V  x  M, 
f2i8(”is-used-in-the-module”)  CVxM, 

/2i9(”is-known-in-the-module’’)  CVxM, 

defining  as 

Pi4  ~  Re  U  Rj  U  itg, 
f2i5  =  Ks  U  U  U  . . .  = 

Rie  =  ((i2i3  o  n  (iJn  o  R^^))  \  {{{R13  o  R-^)  n  {Ru  o  o  R-^), 
Rn  —  RiO  Rii, 

Ri6  =  Rl4<*Rl6, 

Ri9  =  Ri7  U  (Rn  o  R^^ ), 

where  U,  n,  \  and  o  mean  union,  intersection,  difference  and  composition 
(product)  of  relations,  respectively,  it*  —  kth  power  of  a  relation  R,  R~^  — 
the  inverse  relation  to  relation  it,  it**  —  the  transitive  closure  of  a  relation  it, 
the  rule  of  the  obligatory  declaring  of  data  objects  used  within  a  program  may 
be  written  as  the  restriction  itis  C  itis,  and  the  rule  of  the  obligatory 
usage  of  data  objects  declared  in  a  program  —  as  the  restriction 

R\7  C  itis  U  ((i2i8  o  Rie)  \  (Ri7  o  iiiis))- 

By  similar  manner  through  the  use  of  the  relations  from  the  base  collection 
some  useful  auxiliary  relations  may  be  defined,  in  particular, 

A2o(”belong8-to-the-node”)  C  P  x  G, 

Paif  precedes”)  CGxG, 
i?22(”dominates-over”)  C  G  x  G), 

i233(”belong8-to-arnode>preceding-to-arnode-containing-the-point”)  C  PxP, 
and  with  their  help  it  is  possible  to  formulate  some  dependencies  peculiar  to  a 
program  control  and  data  flow.  So,  the  rule  of  the  obligatory  initialization  of 
referring  data  objects  may  be  expressed  by  the  restrictions 

(R^  o Re^  o P7 o iJjo) n R21  C  R22, 

(/2g  ^  o  Rj)  n  Ris  c  ((-R^^  o  Rfl)  n  R33)  o  ((Rg  ^  o  R7)  n  R23)» 

(it  is  assumed  that  undefinitions  of  data  objects  form  separate  c-graph  nodes), 
and  the  sufficient  condition  of  assignments’  nonredundancy  —  by  the  restriction 

(R^-^  o  (R«  U  Rs))  n  R23  C  ((R^ *  o  R7)  n  R23)  o  ((Rf  *  o  (Re  U  Re))  n  R23) 

(in  present  case  it  is  assumed  that  e-graph  of  a  program  consists  of  nodes- 
statements,  but  not  of  nodes-blocks). 

In  the  same  way,  by  introducing  new  sets  of  model  elements  reflecting  types 
and  the  size  of  program  data  objects,  and  by  determining  relations  on  them 
reflecting  characteristics  of  program  data  objects  and  their  use  as  formal  and 
actual  parameters  one  can  formulate  the  restrictions  expressing  requirements  of 
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correspondence  of  some  program  elements’  characteristics  in  certain  situations, 
in  particular,  of  correspondence  of  a  type  and  length  of  formal  and  actual 
parameters,  of  a  type  of  left  and  right  assignment  parts,  etc.  As  the  such  sets 
may  be  suggested  the  sets,  say,  C,F,E,T  and  S  representing,  respectively, 
constants,  functions,  expressions,  and  also  possible  types  and  sizes  of  program 
elements,  and  as  the  relations  on  them  supplementing  the  base  collection  — 
the  relations 

i224(”bas-type’’)  C  (V  U  C  U  F  U  £?)  x  T, 
iJ25(”has-«ze”)  C  (V^  U  C  U  F  U  F)  x  5, 

F26i(’’is-the-ith-actual-parameter-in-the-call-at”)  C(KUCUFUF)  x  F, 
F27j(”has-the-ith-formal-parameter”)  C  M  xV. 

In  the  context  of  notions  we  introduced,  many  kinds  of  a  static  program  anal¬ 
ysis  are  reduced  to  checking  of  formulated  above  correlations  between  elements 
of  respective  models. 

Let  us  illustrate  the  possibility  of  Prolog-implementation  of  checking  model 
restrictions  we  introduced  through  the  following  example.  Let  relations  of  a 
base  set,  specifically,  Ry  -  Rn  be  determined  in  a  Prolog-program  (via  setting 
a  respective  number  of  facts).  Then,  for  instance,  the  relations  R14  -  Riq  are 
determined  as: 

X  is-used-at  y  if  (either  x  is-defined-at  y 

or  (either  x  is-referred-to-at  y 
or  X  is-undefined-at  y)) 

*  is-nested-in  y  if  (either  x  is-directly-nested-in  y 

or  X  is-directly-nested-in  z  and 
z  is-nested-in  y) 

X  belongs-to-the-module  y  if  y  begins-at  z  and 

X  not-less-than  z  and 
y  terminates-at  zl  and 
X  not-greater-than  zl  and 
not  (yl  is-nested-in  y  and 

X  belongs-to-the-module  yl) 

X  is-declared-in-the-module  y  if  x  is-declared-at  z  and 

z  belongs-to-the-module  y 
X  is-used-in-the-module  y  if  x  is-used-at  z  and 

z  belongs-to-the-module  y 

X  is-known-in-the-module  y  if  either  x  is-declared-in-the-module  y 

or  X  is-declared-in-the-module  z  and 
y  is-nested-in  z) 

(relations  "not-greater-than”  and  "not-less-than”  in  Prolog  program  is  deter¬ 
mined  by  means  of  rules  using  built-in  arithmetic  relation  LESS,  for  example: 

X  not-greater-than  x 
X  not-greater-than  y  if  x  LESS  y 
X  not-less-than  x 
X  not-less-than  y  if  y  LESS  x). 


213 


and  checking  of  restriction  Ris  C  i2i9,  i.e.  revealing  of  model  elements  (and 
consequently,  of  program  ones),  not  satisfying  this  restriction  (in  this  case  — 
used,  but  not  declared  program  data  objects),  is  implemented  by  the  query 

which  (x  :  x  is-used-in-the-module  y  and  not  x  is-known-in-the-module  y). 

5  SAIL  System 

The  considered  approach  is  the  basis  of  a  static  and  dynamic  program  analysis 
in  the  SAIL  system  [11],  which  analyses  programs  written  in  Fortran-77  and 
consists  of  three  principle  parts: 

1)  Fortran-program  analyzer,  making  restricted  parsing,  construction  of  c- 
graph,  instrumentation,  extraction  of  information  from  specially  organized  in¬ 
troductory  comments,  cyclomatic  complexity  [12]  and  commentedness  measure¬ 
ment  and  on  its  basis  —  construction  of  program  state  net  model  analyzed  (in 
fact,  it  is  a  translator  of  syntax  correct  Fortran-programs  into  Prolog- models); 

2)  program  state  net  models  analyzer,  providing  the  solution  of  different 
tasks  in  program  analysis; 

3)  monitor,  which  is  a  mediator  between  user  and  above  analyzers  and  pro¬ 
viding  user-friendliness  of  the  system. 

Micro- Prolog  [10]  is  the  language  of  Fortran-program  state  models  represen¬ 
tation  and  analysis  and  the  one  of  monitor  implementation,  and  Turbo-Pascal 
is  the  language  used  for  Fortran-program  analyzer  iiDplementation. 

The  implemented  version  of  SAIL  system  defines  infeasible  parts  of  code 
and  latent  cycles,  reveals  the  usage  of  non-initiated  variables  and  redundant 
assignments,  checks  up  the  usage  of  the  variables  declared  and  declaration  of 
ones  being  used,  reveals  untested  program  parts  and  proposes  available  plans 
of  testing,  and  also  provides  a  user  with  various  common  information  about 
program  and  its  characteristics.  So,  it  allows  the  users  to  get  data  about 
size,  number  of  entries  and  exits,  complexity  and  commentedness  by  meems 
of  requests  including  complex  conjunctive  ones  setting  different  combinatorial 
variants  of  program  characteristics.  An  example  of  conjunctive  request,  formed 
by  user  with  the  selection  and  refining  of  corresponding  menu  lines  is  ’’Point 
the  program  module,  which  has  length  more  than  100,  complexity  more  than 
20  and  commentedness  less  than  10” . 

Besides,  a  user  of  SAIL  has  the  possibility  of  asking  the  system  different 
questions  connected  with  a  program  (in  Russian),  having  simple  or  complex 
(copjunctive)  conditions.  The  question  form  is  fixed,  though  it  avails  some 
"liberties”  (e.g.  commas,  prepositions,  inflexional  endings),  which  make  it 
closer  to  natural  one.  This  possibility,  which  may  make  easier  a  task  of  program 
maintenance,  is  based  on  user  "understanding”  of  program  representation  at 
his  level  and  ensured  by  simplicity  of  this  representation,  by  its  likeness  to 
natural  language  clauses  and  easiness  of  such  clauses  parsing  implementation 
with  Prolog. 

Examples  of  questions  with  simple  conditions  are:  What  is  rl?  What  type 
does  every  variable  have?  Which  module  uses  the  variable  index?  What  mod- 
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ules  use  arrays?  Which  variables  are  declared  implicitly?  In  which  line  is  every 
array  declared? 

Examples  of  questions  with  complex  conditions  are:  What  type  and  length 
does  the  variable  index  have?  What  type,  length  and  dimension  does  every 
array  have?  What  module  calls  module  si  and  is  called  by  module  s2?  What 
variable  has  type  real,  length  4  and  belongs  to  common  block? 

Form  and  contents  of  answers  also  are  oriented  to  the  support  of  natural  lan¬ 
guage  dialogue.  So,  the  answer  to  the  question  similar  to  the  first  of  ones  given 
above  will  contain  complete  information  about  the  program  object  mentioned, 
for  example: 


rl  is  the  variable  of  the  module  def,  which 
is  a  formal  parameter, 
is  declared  implicitly, 
has  the  type  real, 
has  length  (in  bytes)  4, 
is  defined  in  lines  4  14  20, 
is  referred  to  in  lines  21  22, 
rl  is  the  variable  of  the  module  quad,  which 
is  declared  in  line  40, 
has  the  type  integer. 

and  the  answer  to  the  second  question  may  be  the  following: 

in  the  module  def: 

yes  —  character, 
i  —  integer, 
rl  —  real, 
r2  —  real. 

While  typing  a  user  question  the  system  offers  him  a  number  of  convenient 
prompts  and  while  answering  all  the  requests  it  makra  the  research  area  more 
precise,  allowing  the  user  to  choose  the  names  of  modules  in  which  he  takes 
interest  from  submenu,  or  to  denote  a  search  mode  in  all  program  modules. 

6  Possible  Fields  of  the  Application 

Fortran,  due  to  its  specific  features,  is  the  language  traditionally  selected  for 
illustration  of  possibilities  of  program  analysis  systems  [13],  but  most  of  the 
analysis  kinds  implemented  in  SAIL  system  is  common  for  a  sufficiently  large 
number  of  programming  languages,  in  the  first  turn,  for  ones  of  a  procedural 
type  [14].  For  this  reason,  many  judgments  given  above  are  true  for  analysis  of 
programs  written  in  other  languages. 

Program  life  cycle  usually  includes  its  specification,  design,  coding,  testing 
and  maintenance.  The  principles  of  the  approach,  given  above  in  application  to 
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program  code  analysis,  naturally  spread  over  other  representations  (specifica¬ 
tion,  project)  analysis,  used  during  the  program  life  cycle.  Besides,  with  such 
representations  models  it  is  possible  also  to  analyze  program  representations 
interrelations  in  different  stages  of  program  life  cycle  for  these  interrelations 
co-ordination  support  in  the  case,  for  example,  of  modifying  some  of  them. 

The  main  principles  of  this  approach,  which  are  based  on  semantic  net  model 
construction  in  knowledge  domain  and  on  making  available  the  intelligent  inter¬ 
face  with  the  given  model,  can  be  used  for  organization  of  support  of  different 
processes  attendant  to  program  making  and  development  [15].  Such  processes 
are  program  development  control,  program  configuration  control,  etc. 

7  Conclusions  and  Future  Work 

The  approach  to  computer  support  organization  of  program  testing  and  anal¬ 
ysis  has  been  considered. 

Main  advantages  of  the  offered  approach  are: 

1)  reflection  of  various  program  features  in  a  uniform  model  and  diversity  of 
supporting  analysis  kinds  (including  common  ones  for  a  sufficiently  numerous 
set  of  programming  languages); 

2)  possibility  of  spreading  the  main  approach  ideas  over  the  other  program¬ 
ming  languages  (by  appropriate  program  analyzers’  development)  and  over  the 
other  (specific  for  every  language)  analysis  kinds,  and  also  over  the  support  of 
analysis  of  program  representations  differed  from  program  code  (i.e.  specifica¬ 
tion,  design); 

3)  ensured  by  Prolog  use,  the  possibility  of  inference  of  new  knowledge  about 
a  program  from  the  basic  one  represented  in  the  model,  that  allows,  on  the  one 
hand,  to  decrease  (to  some  necessary  minimum)  a  model  size  and,  on  the  other 
hand,  to  add  new  analysis  kinds  without  model  extending,  in  particular,  to  get 
new  quality  appraisals  when  additional  criteria  are  adopted; 

4)  based  on  the  simplicity  of  knowledge  r«.i<eesentation  at  the  ’’user”  level, 
similarity  of  semantic  net  facts  to  natural  language  sentences  and  well  known 
Prolog  ability  for  such  sentences  parsing,  possibility  of  organization  question¬ 
answering  dialogue  of  user  and  system  with  the  help  of  some  subset  of  natural 
language. 

Works  on  the  SAIL  project  are  still  in  progress.  Main  directions  of  the 
further  developing  of  SAIL  are: 

1)  increasing  of  the  number  of  supporting  programming  languages  (by  ap¬ 
propriate  program  source  code  analyzers  development  and  making  more  precise 
an  analysis  kinds  totality  for  each  concrete  language); 

2)  increasing  of  the  number  of  deciding  tasks  (by  extending  a  set  of  the  used 
analysis  kinds,  testing  criteria  and  coding  style  metrics); 

3)  increasing  of  the  number  of  languages  for  communication  with  the  system; 

4)  differentiation  of  the  system  for  two  directions,  namely,  its  developing,  on 
the  one  hand,  as  analysis,  testing  and  maintenance  tool  for  programmers,  and, 
on  the  other  hand,  as  means  of  teaching  students  basic  technological  aspects 
of  high  quality  program  development. 


216 


References 

1.  Ince  DC.  The  provision  of  procedural  and  functional  interfaces  for  the 
maintenance  of  program  design  and  program  language  notations.  SIG- 
PLAN  Not  1984;  19,  2:  68-74. 

2.  Woodman  M.  A  program  design  language  for  software  engineering.  SIG- 
PLAN  Not  1984;  19,  8:  109-118. 

3.  Ince  DC.  A  program  design  language  based  software  maintenance  tool. 
Softw  Pract  Exper  1985;  15,  6:  583-594. 

4.  Ince  DC,  Woodman  M.  The  rapid  generation  of  a  class  of  software  tools. 
Comput  J  1986;  29,  2:  151-160. 

5.  Meier  B.  The  software  knowledge  base.  In:  8th  Int.  Conf.  Softw.  Eng., 
London,  August,  28-30,  1985.  Proc.  Washington,  1985,  pp  158-165. 

6.  Ince  DC.  Module  interconnection  language  and  Prolog.  SIGPLAN  Not 
1984;  19,  8:  89-93. 

7.  Leung  CHC,  Choo  QH.  A  knowledge-base  for  effective  software  specifica¬ 
tion  and  maintenance.  In:  3-rd  Int.  Workshop  Softw.  Specif,  and  Des., 
London,  Aug.,  16-17,  1985,  pp  139-142. 

8.  Yau  SS,  Nicholl  RA,  Tsai  JJ,  Liu  SS.  An  integrated  life-cycle  model  for 
software  maintenance.  IEEE  Trans  Softw  Eng  1988;  14,  8:  1128-1144. 

9.  Galkin  IM.  Net  modeling,  static  and  dynamic  program  analysis.  Prepr. 
No.5(455),  Minsk,  The  Inst,  of  Mathematics  of  Byelorussian  Academy  of 
Sciences,  1991;  in  Russian. 

10.  Clark  KL,  McCabe  FG.  Micro-Prolog:  programming  in  logic.  Prentice- 
Hall,  1984. 

11.  Galkin  IM.  Semantic  nets  in  program  analysis.  In:  Mixed  computations 
and  transformation.  Novosibirsk,  1991,  pp  112-120;  in  Russian. 

12.  McCabe  TJ.  A  complexity  measure.  IEEE  TVans  Softw  Eng  1976;  SE-2, 
4:  308-320. 

13.  DeMillo  RA,  McCracken  WM,  Martin  RJ,  Passafiume  JF.  Software  testing 
and  evaluation.  Menlo  Park,  1987. 

14.  Galkin  IM.  Usage  of  semantic  nets  in  a  process  of  program  making  and 
maintenance.  Prepr.  No  32(432),  Minsk,  The  Inst,  of  Mathematics  of 
Byelorussian  Academy  of  Sciences,  1990;  in  Russian. 

15.  Galkin  IM.  Usage  of  semantic  nets  for  program  modeling  and  analysis. 
USiM  (Control  Systems  and  Machines)  1991;  5:  55-61;  in  Russian. 


Session  7 


DEPENDABLE 

SOFTWARE 

Chair:  W.  Ehrenberger 
GRS  Forschungsgelande,  Garching,  D 


Robust  Requirements  Specifications  for 
Safety  “Critical  Systems 

Amer  Saeed,  Rogerio  de  Lemos  and  Tom  Anderson 
Department  of  Computing  Science,  University  of  Newcastle 
Newcastle  upon  Tyne,  NEl  7RU,  UK 

Abstract 

Experience  in  safety-critical  systems  has  shown  that  deviations 
from  assumed  behaviour  can  and  do  cause  accidents.  This 
suggests  that  the  development  of  requirements  specifications  for 
such  systems  should  be  supported  with  a  risk  analysis.  In  this 
paper  we  present  an  approach  to  the  development  of  robust 
requirements  specifications  (i.e.  specifications  that  are  adequate 
for  the  risks  involved),  based  on  qualitative  and  quantitative 
analyses. 

1  Introduction 

During  software  development,  the  phase  of  requirements  analysis  provides  the 
system  context  in  which  the  software  requirements  must  be  considered.  This  is  a 
fundamental  issue  for  safety-critical  systems  because  “safety”  is  essentially  an 
attribute  of  the  system  rather  than  just  software.  The  work  in  this  paper  enhances 
a  methodology  for  the  requirements  analysis  of  safety— critical  process  control 
systems  [1]  by  incorporating  techniques  for  the  production  of  robust  requirements 
specifications,  and  by  providing  means  to  evaluate  these  specifications  against  the 
system  risks.  A  robust  requirements  specification  is  constructed  by  modifying  a 
specification  to  take  into  account  violations  in  the  assumptions  upon  which  the 
specification  is  based,  and  the  possibility  of  specifications  being  violated  due  to 
faults  that  might  be  introduced  during  later  stages  of  software  development. 
System  risk  is  related  to  the  likelihood  of  a  system  entering  into  a  hazard  state,  the 
likelihood  that  the  hazard  will  lead  to  an  accident,  and  the  expected  potential  loss 
associated  with  such  an  accident  [2]. 

Robust  requirements  specifications  are  obtained  by  conducting  qualitative  and 
quantitative  analysis  of  the  requirements.  Analysis  aims  to  provide  confidence 
that  the  level  of  risk  is  acceptable.  The  qualitative  analysis  seeks  to  identify  those 
circumstances  that  can  lead  to  violations  of  a  specification,  and  subsequently  take 
the  system  into  a  hazard  state.  The  quantitative  analysis  attaches  probabilities  to 
the  occurrence  of  the  identified  circumstances,  in  order  to  estimate  the  risk 
associated  with  a  specification.  The  risk  estimates  provide  the  basis  for 
conducting  risk  assessments,  that  compare  alternative  specifications  and  judge  if 
ti  isk  is  acceptable.  For  the  process  of  requirements  analysis  we  adopt  the 
api^foach  of  analysing  the  system  from  different  perspectives  and  using  different 


techniques  [3].  This  approach  enables  the  extraction  of  different  (and 
complementary)  information  concerning  the  robustness  of  the  requirements 
specifications. 

In  summary,  the  enhancements  proposed,  in  this  paper,  for  the  basic 
methodology  are  as  follows.  For  each  level  of  abstraction  at  which  the  analysis  is 
performed,  the  assumptions  are  identified  and  recorded,  and  the  fault  analysis  of 
the  specifications  is  conducted,  with  the  aim  of  analysing  the  circumstances  in 
which  the  specifications  are  unable  to  maintain  safe  behaviour  from  the  ^stem. 
In  other  words,  apart  from  checking  how  good  the  specifications  are,  the  aim  is  to 
identify  their  weakness,  and  modify  the  specifications,  in  to  make  them  more 
robust. 

The  methodology  and  its  enhancements  will  be  presented  as  follows.  The  next 
section  describes  a  methodology  for  requirements  analysis.  Section  3  describes 
how  quality  can  be  attained,  in  terms  of  risk,  by  performing  qualitative  and 
quantitative  analysis  using  viewpoints.  Finally,  section  4  contributes  some 
concluding  remarks. 

2  A  Methodology  for  Requirements  Analysis 

In  this  section  we  overview  a  methodology  for  requirements  analysis;  a  more 
detailed  discussion  is  given  elsewhere  [1].  The  methodology  consists  of  a 
fi’amework  with  distinct  phases  of  analysis,  a  graph  that  depicts  the  relationship 
between  the  specifications  produced  during  the  analysis,  and  a  set  of  formal 
techniques  appropriate  for  the  issues  to  be  analysed  at  each  phase. 

2.1  Framework  for  Requirements  Analysis 

The  framework  adopts  the  approach  of  separating  the  mission  from  the  safety 
requirements  during  an  initial  phase,  and  then  partitioning  the  analysis  of  the 
safety  requirements  into  distinct  phases.  Each  phase  of  analysis  is  focused  onto  a 
specific  domain,  where  the  identification  of  the  relevant  domains  follows  directly 
from  the  components  (i.e.  operator,  plant  and  controller)  of  a  general  structure 
for  safety— critical  systems,  and  the  relationship  between  the  phases  is  dictated  by 
the  interactions  between  these  components.  The  analysis  of  the  phases  will  take 
into  account  non-standard  behaviours  of  the  entities  of  a  domain;  a  basis  for  the 
analysis  is  provided  by  establishing  the  standard,  exceptional  and  failure 
behaviours  of  the  entities  [4]. 

•  Conceptual  Analysis.  The  objective  of  this  phase  is  to  produce  an  initial 
statement  of  the  aim  and  purpose  of  the  system  and  determine  those  failure 
behaviours  of  the  system  which  constitute  accidents.  As  a  product  of  this 
phase  we  obtain  the  Safety  Requirements,  enumerating  the  accidents.  The 
accidents  are  the  basis  for  separating  mission  from  safety  issues.  Another 
activity  to  be  performed  during  this  phase  is  the  identification  of  the  modes 
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of  operation  of  the  system;  these  are  classes  of  states  that  group  together 
related  operational  functions. 

•  Safety  Plant  Analysis.  During  this  phase  the  plant  properties  relevant  to  the 
Safety  Requirements,  such  as  the  physical  laws  and  rules  of  operation  that 
govern  plant  behaviour  and  potential  hazards,  are  identified.  The  outcome 
is  the  Safety  Plant  Specification  which  contains  safety  constraints  (conditions 
over  the  physical  process  that  are  the  negations  of  hazards  modified  to 
incorporate  safety  margins)  and  safety  strategies  (schemes  to  maintain  safety 
constraints  defined  as  a  set  of  conditions,  in  terms  of  controllable  factors, 
over  the  physical  process). 

•  Safety  Interface  Analysis.  The  objective  of  this  phase  is  to  delineate  the 
plant  interface,  and  specify  the  behaviour  that  must  be  exhibited  at  that 
interface.  This  phase  leads  to  the  production  of  the  Safety  Interface 
Specification,  containing  the  interface  safety  strategies  (refinements  of  safety 
strategies,  incorporating  properties  of  sensors  and  actuators). 

•  Safety  Contnil  System  Analysis.  During  this  phase  we  establish  a  top  level 
organization  for  the  control  system  in  terms  of  the  properties  of  its 
components,  and  their  interactions.  This  phase  leads  to  the  production  of 
the  Safety  Control  System  Specification,  containing  the  control  system  safety 
strategies  (refinements  of  interface  safety  strategies  incorporating  the 
components  of  the  control  system). 

2.2  Safety  Specification  Graph 

The  specifications  produced  at  the  different  phases  of  the  requirements  analysis, 
are  organized  into  a  Safety  Specification  Graph  (SSG).  The  structure  embodied  in 
modes  of  operations  can  be  reflected  in  the  organization  of  the  requirements 
specifications  by  constructing  a  separate  SSG  for  each  mode.  An  SSG  is  a 
directed  acyclic  graph,  in  which  the  vertices  represent  the  safety  specifications 
(requirements  specifications  for  safety)  and  the  edges  denotes  relationships 
between  the  specifications.  For  a  system  with  p  accidents,  the  SSG  consists  of  p 
component  graphs.  Each  component  graph  is  an  evolutionary  graph  [5];  the 
evolution  is  related  to  the  phases  of  the  framework.  At  each  phase  a  set  of  new 
specifications  is  added  to  the  graph  of  the  previous  phases,  by  connecting  the 
specifications  to  the  terminal  vertices  (representing  the  specifications  of  the 
previous  phase)  of  the  graph. 

On  completion  (see  figure  1)  the  top  element  of  each  component  graph  is  an 
accident  (denoted  by  ACj)  and  is  related  to  a  set  of  hazards  (HZjj)  that  can  lead  to 
it.  Each  hazard  is  related  to  the  safety  constraint  (SQj)  that  negates  the  hazard, 
and  each  safety  constraint  related  to  the  safety  strategies  (SSjjjc)  that  maintain 
the  constraint.  Then  the  safety  strategies  are  related  to  their  refinements  into 
interface  safety  strategies  (ISSij^kj),  and  a  similar  relation  is  depicted  for  control 


system  strategies  (CSSij,](.i.m)-  When  more  than  one  strategy  is  related  to  a 
specification  of  a  previous  level  either  the  strategies  are  exclusive  and  a  choice 
has  to  be  made  in  later  stages  of  development  to  implement  a  single  strategy,  or 
the  strategies  complement  each  other  and  all  are  needed  to  attain  the  confidence 
required  for  the  risk  involved. 
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Figure  1.  Example  safety  specification  graph 
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The  SSG  of  a  system  provides  support  in  conducting  a  qualUative  analysis  and 
provides  the  basis  for  a  systematic  approach  to  the  modification  of  the 
specifications.  For  the  qualitative  analysis,  support  is  provided  by  establishing  the 
conditions  (which  follow  from  the  edges  of  the  component  graphs)  that  must  be 
confirmed  to  ensure  that  the  specifications  maintain  safe  behaviour.  A  key 
concern  in  modification  is  traceability^  that  is  the  ability  to  trace  back  from  a 
strategy  to  its  origins  and  to  trace  forward  to  the  strategies  which  are  derived 
from  the  strategy.  Support  for  traceability  is  provided  by  constructing  reachability 
and  adjacency  matrices  for  the  SSG.  These  matrices  enable  the  localization  of  the 
side— effects  of  a  modification  and  identification  of  the  relationships  that  must  be 
reconfirmed,  thereby  increasing  the  assurance  that  when  changes  are  necessary 
they  will  be  complete  and  consistent. 

23  Techniques  for  the  Framework 

For  the  application  of  formal  notations  and  techniques,  the  approach  adopted  is 
to  employ  notations  in  accordance  with  the  characteristics  of  the  system  to  be 
analysed  during  the  different  domains  of  analysis.  Within  the  context  of  the 
framework,  the  relevant  formalisms  are  grouped  into  two  classes:  descriptive  and 
operational.  A  descriptive  formalism  specifies  the  behaviour  of  a  domain  in  terms 
of  axioms  (representing  system  properties)  over  a  model  of  the  domain,  whereas 
an  operational  formalism  is  used  to  model  the  activities  and  interactions  between 
the  entities  of  a  domain.  Real  Time  Temporal  Logic  and  Timed  History  Logic  are 
examples  of  descriptive  formalisms,  and  Statecharts  and  Predicate-T’ansition 
Nets  are  examples  of  operational  formalisms.  The  extent  to  which  each  class  of 
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formalism  is  applied  in  a  specific  domain  depends  on  the  level  of  abstraction  at 
which  the  domain  resides.  At  higher  levels,  descriptive  formalisms  should  be 
more  prominent,  however  at  the  lower  levels  operational  formalisms  become 
increasingly  relevant. 

In  order  to  describe  the  behaviour  of  systems  at  different  levels  of  abstraction,  we 
adopt  an  event/action  model  (E/A  model)  [1].  The  main  features  of  the  E/A 
model  are  that  its  primitive  concepts  (events,  actions,  states  and  the  concept  of  a 
time  line)  can  be  expressed  in  both  descriptive  and  operational  formalisms,  and  it 
supports  both  discrete  and  dense  time  structures.  When  employed  in  the 
framework,  the  models  of  system  behaviour,  constructed  at  the  different  phases, 
are  built  on  top  of  a  common  foundation  providing  support  for  verification 
between  the  different  levels  of  abstraction 

3  Quality  Analysis  of  Safety  Specifications 

One  important  factor  in  determining  the  quality  of  the  specifications  for 
safety— critical  systems  is  the  risk  analysis  of  the  safety  specifications;  this  aims  to 
determine  if  the  contribution  of  the  software  to  the  overall  system  risk  is 
acceptable.  In  order  to  achieve  this  aim,  a  bridge  has  to  be  established  between 
the  risk  analysis  of  the  system  and  the  software.  Vfithin  the  context  of  the 
methodology,  this  bridge  is  established  through  the  SSG  by  relating  the  system 
requirements  to  the  software  requirements.  To  perform  the  risk  analysis,  those 
circumstances  which  can  violate  a  specification,  and  cause  the  system  to  enter 
into  a  hazard  state,  have  to  be  identified  and  their  probability  of  occurring 
calculated.  Once  the  risk  is  quantified  we  are  able  to  judge  whether  the  risk 
associated  with  a  specification  is  acceptable  or  not  (risk  assessment).  If  not,  the 
specification  has  to  be  modified  or  combined  with  other  specifications  in  order  to 
reduce  the  risk.  As  a  result,  we  obtain  a  robust  safety  specification  which  is  a 
specification  that  can  be  violated  only  within  an  acceptable  risk.  It  should  be 
noted  that  the  risk  analysis  presented  in  this  section  does  not  take  into  account 
the  consequences  of  an  accident. 

During  the  operation  of  the  system,  the  occurrence  of  an  initiating  event  (an 
event  which  can  lead  the  system  into  a  hazard  state)  of  an  accident  sequence  [6] 
distinguishes  two  kinds  of  system  state:  safe  and  unsafe  state.  An  umafe  state  is  a 
state  which  could  lead  the  system  into  a  hazard  state  in  the  absence  of  corrective 
action  and  in  the  absence  of  subsequent  initiating  events.  If  a  state  is  not  an 
unsafe  state  then  it  is  said  to  be  safe.  These  definitions  ensure  that  a  hazard 
cannot  occur  subsequent  to  a  safe  state  if  no  initiating  event  occurs.  In  terms  of 
the  requirements  specifications,  the  concept  of  initiating  event  refers  to  those 
circumstances  which  can  lead  to  the  violation  of  a  safety  specification. 

The  quality  analysis  of  the  requirements,  in  each  domain  of  analysis,  is  performed 
from  two  different  perspectives:  qualitative  and  quantitative.  The  purpose  of  the 
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qualitative  analysis  is  to  identify  those  circunnstances  which  can  violate  a 
specificatioiii,  and  analyse  the  impact  of  these  violations  upon  the  safety  of  the 
system.  These  circumstances  are  related  to  the  violation  of  assumptions  upon 
which  a  specification  is  based  and  to  the  violation  of  certain  conditions  of  a 
specification.  The  quantitative  analysis  complements  the  qualitative  analysis  by 
attaching  occurrence  probabilities  to  these  circumstances.  In  order  to  ensure  that 
essential  system  behaviour  is  not  precluded,  the  restrictions  that  a  safety 
specification  will  impose  on  the  mission  must  also  be  considered. 

3.1.  Qualitative  Risk  Analysis 

At  each  level  of  abstraction,  analysis  is  conducted  over  safety  specifications 
(descriptions  of  safe  behaviour  at  the  level)  and  assumptions  (properties 
assumed  at  the  level).  In  the  proposed  approach  the  qualitative  analysis  is 
conducted  in  two  stages;  firstly  we  perform  the  preliminaiy  analysis  and  secondly 
the  vulnerability  analysis  of  the  safety  specifications. 

3.1.1  Preliminary  Analysis 

In  this  paper,  we  consider  the  preliminary  analysis  to  be  the  analysis  that  must  be 
conducted  prior  to  the  risk  analysis.  This  analysis  will  involve  confirming  that  the 
specifications  at  a  particular  layer  of  the  SSG  comply  with  those  of  the  layer  that 
precede  it,  and  that  the  specifications  in  a  layer  are  consistent.  The  relationships 
that  must  be  confirmed,  to  ensure  compliance  between  the  layers,  follow  from  the 
edges  of  the  SSG.  Demonstrating  compliance  between  the  layers  involves 
employing  both  verification  (formal  analysis)  and  validation  (informal  analysis) 
techniques.  The  hazards  are  validated  against  the  accidents  and  the  safety 
constraints  are  verified  against  the  negation  of  the  hazards.  Subsequently  the 
strategies  are  verified  against  the  specifications  of  the  previous  layer.  At  each 
layer,  any  assumptions  required  to  confirm  the  relationships,  depicted  by  the 
edges  of  the  SSG,  are  recorded.  As  an  example  of  the  relations  that  must  be 
verified,  we  examine  the  edge  (from  the  SSG  in  figure  1)  that  connects  the  safety 
constraint  SCj,i  to  the  safety  strategy  SSi,i,i.  Let  us  suppose  that  the  strategy  is 
based  upon  assumption  A  (which  represents  a  property  of  the  physical  process); 
the  relationship  to  be  confirmed  is  then; 

A  A  SSi.1,1  =>  SCj.i  fl 

A  result  of  the  preliminary  analysis  is  that  the  circumstances  under  which  safe 
behaviour  is  maintained,  are  clearly  scoped  and  organised  in  accordance  with 
their  contribution  to  each  phase  of  the  analysis.  This  activity  ensures  that  the 
knowledge  gained  during  the  development  and  validationAerification  of  the 
safety  specifications  can  be  applied  effectively  during  the  risk  analysis. 

3. 1.2  Vidnerability  Analysis 

After  performing  the  preliminary  analysis  of  the  safety  specifications,  the 
qualitative  risk  analysis  consists  of  performing  the  vulnerability  analysis  of  the 
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specifications  which  probes  the  safety  specification,  and  associated  assumptions, 
to  identify  the  circumstances  under  which  the  specification  is  unable  to  maintain 
safe  behaviour,  i.e.  the  violation  of  a  specification.  Once  these  circumstances  are 
identified,  the  safety  specifications  can  be  modified  to  become  more  robust 
against  possible  violations.  An  initial  step  in  the  vulnerability  analysis  is  to  negate 
the  relationships  obtained  during  the  preliminary  analysis  and  to  identify  the 
system  states  which  can  lead  to  the  violation  of  the  specification  when  the  above 
circumstances  occur. 

For  the  relationship /7,  the  logical  assertion  is  negated  and  those  plant  states  (PS) 
which  can  lead  to  the  violation  of  SQ,!  are  identified: 

-SCi.i=«.  (-SSi,i.i  A  PS)  V(-A  A  PS)  /2 

For  this  relationship,  which  is  associated  with  the  plant  level,  the  subsequent 
vulnerability  analysis  of  the  safety  strategy  SSi,i,i  will  identify  those  conditions 
that  can  lead  to  the  violation  of  SCi,i. 

Although  logical  formulae  are  useful  in  obtaining  a  high-level  view  of  the 
relationship  between  the  specifications  and  assumptions,  such  formulae  provide 
limited  support  for  a  failure  analysis.  A  suitable  representation,  for  such  analysis, 
is  one  which  supports  the  identification  of  possible  failure  behaviours  that  can 
lead  to  the  identified  hazardous  states.  In  this  paper,  to  perform  the  vulnerability 
analysis  of  the  safety  specifications,  we  employ  fault  tree  analysis  (FTA)  [7]  which 
has  been  used  extensively  in  the  analysis  of  system  safety  and  more  recently  in  the 
analysis  of  software  safety  [8].  A  key  feature  of  fault  tree  analysis  that  makes  it 
suitable  for  the  analysis  to  be  conducted  here,  is  that  the  analysis  is  restricted  to 
the  identification  of  system  components  and  conditions  that  lead  to  one 
particular  undesired  system  state. 

To  construct  a  fault  tree  for  the  relationship /2,  the  initial  step  is  to  identify  the 
undesired  state,  in  this  case  the  negation  of  the  safety  constraint  SQ,!,  and  then  to 
determine  the  set  of  possible  causes  which  can  lead  to  the  undesired  state  (refer 
to  figure  2).  For  the  logical  formula/2,  we  identify  the  violation  of  the  assumption 
and  the  violation  of  the  safety  strategy  SSj  j  j.  The  latter  has  to  be  further  refined 
in  order  to  identify  its  primary  events. 

Qualitative  risk  analysis  provides  a  basis  for  obtaining  more  robust  safety 
specifications  which  will  lead  to  a  risk  abatement  of  the  overall  system.  In  the 
approach  adopted,  the  analysis  is  performed  by  employing  both  formal  analysis 
and  fault  tree  analysis  in  order  to  determine  the  weaknesses  of  the  safety 
specifications.  Once  these  weaknesses  are  identified  the  safety  specifications  can 
be  modified  to  incorporate  mechanisms  which  aim  to  reduce  their  vulnerability. 

3.2  Quantitative  Risk  Analysis 

In  this  section  we  discuss  how  a  quantitative  analysis  complements  the  qualitative 
analysis  by  introducing  a  measurement  of  confidence  in  the  quality  of  the  safety 
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Figure  2.  Fault  tree  for  formula  f2 

specifications.  While  the  latter  identifies  circumstances  which  can  lead  to  the 
violation  of  the  specifications,  the  former  associates  probabilities  to  these 
circumstances. 

Although  the  qualitative  approach  strives  to  achieve  total  assurance  for  the  safety 
specifications,  there  are  three  basic  limitations  which  indicate  that  this  aim  may 
not  be  realised.  A  first  limitation  stems  from  the  process  of  capturing  user 
requirements:  some  faults  introduced  during  the  requirements  stage  may  not  be 
removed  during  the  verification  process,  nor  can  it  be  guaranteed  that  all  such 
faults  will  be  removed  during  validation.  A  second  limitation  arises  from 
observing  past  experience  in  the  utilization  of  formal  techniques,  which  shows 
that  a  formal  verification  may  itself  contain  faults  [9].  The  third  limitation  is 
related  to  the  confidence  that  can  be  placed  on  the  assumptions  upon  which  a 
specification  is  based.  From  these  limitations  we  infer  that  even  after  performing 
the  qualitative  analysis  we  are  still  faced  with  uncertainties  concerning  the  quality 
of  the  safety  specifications,  hence  the  necessity  to  quantify  the  uncertainties  in 
order  to  obtain  a  level  of  confidence  in  the  quality  of  the  safety  specifications.  In 
other  words,  the  aim  is  to  obtain  an  early  prediction  of  the  contribution  of  the 
software  to  the  risk  of  the  system. 

lb  associate  occurrence  probabilities  to  those  circumstances  which  can  lead  to 
the  violation  of  the  specification,  such  as  plant  states  and  violation  of 
assumptions,  might  not  be  a  difficult  task.  On  the  other  hand,  associating 
occurrence  probabilities  to  the  violation  of  certain  conditions  that  depend  on  a 
software  implementation  is  more  problematic  because  during  the  requirements 
phase  of  software  development  sufficient  design  and  implementation 
information  is  not  yet  available.  Instead  of  estimating  the  probability  of  a 
condition  to  be  violated,  target  probabilities  demanded  from  the  higher  level 
safety  specifications  such  as  hazards,  can  be  used.  However,  once  a  specification 
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is  sufficiently  detailed,  currently  available  techniques  which  attempt  to  make 
early  predictions  of  the  software  reliability  can  be  used,  such  as:  metrics  [10], 
product-in— a-process  [11],  and  execution  of  the  specifications  [12]. 

After  conducting  the  quantitative  risk  analysis,  the  last  stage  of  the  quality  (risk) 
analysis  is  to  perform  the  risk  assessment  of  the  safety  specifications.  This  is  a 
judgement  based  on  the  estimated  risk  which  provides  guidance  for  high  level 
decisions,  usually  associated  with  the  process  of  requirements  analysis.  The 
results  obtained  from  the  quantitative  analysis  should  be  considered  as  a  relative 
measurement  of  how  effective  a  given  strategy  is  in  reducing  the  risk  of  a  hazard, 
compared  to  the  results  obtained  for  alternative  strategies.  Hence  it  is  most 
useful  in  determining  which  strategy  or  combination  of  strategies  is  most  suitable 
for  the  risks  involved  (the  choice  of  a  strategy  might  also  be  influenced  by 
constraints  imposed  by  the  implementation,  e.g.  availability  of  sensors  and 
actuators  with  the  required  properties).  Also,  if  the  utilization  of  more  than  one 
strategy  is  required,  this  preliminary  risk  analysis  facilitates  the  search  for  a 
suitable  combination  of  the  available  strategies  in  order  to  avoid  common  mode 
failures. 

33  Mission  and  Safety  Analysis 

The  primary  aim  of  the  quality  analysis  presented  in  this  paper  is  to  reduce  the 
system  risk.  However,  it  is  usually  impossible  to  maintain  a  complete  dichotomy 
between  the  mission  and  safety  aspects,  and  it  would  be  futile  to  impose  safety 
requirements  which  were  so  stringent  that  the  system  aiuld  not  satisfy  its  mission, 
lb  complement  the  risk  analysis,  the  impact  of  the  safety  specifications  on  the 
mission  of  the  system  must  also  be  considered.  Such  an  analysis  involves  relating 
the  different  safety  specifications  to  the  mission  requirements  that  can  be 
affected  by  them.  If  analysis  of  the  mission  requirements  follows  the  framework 
described  in  section  2.1,  leading  to  the  construction  of  a  Mission  Specification 
Graph  (MSG),  a  comparison  between  the  safety  specifications  and  mission 
specifications  (requirements  specifications  for  the  mission)  is  made  possible. 
During  the  development  of  a  robust  safety  specification  it  would  be  possible  to 
identify  the  mission  specifications  that  may  be  influenced  by  inspecting  the 
variables  that  are  restricted  by  the  safety  specifications  and  relating  these  to  the 
mission  specifications  at  the  same  level  of  abstraction.  Once  the  relevant  mission 
specifications  have  been  identified  an  informal  analysis  of  the  restrictions  that 
the  safety  specification  imposes  on  the  mission  can  be  conducted.  An  example  of 
such  an  analysis  is  presented  elsewhere  [13]. 

4  Conclusions 

This  paper  describes  a  ^stematic  approach  for  the  quality  analysis  of  the 
requirements  specifications,  in  the  context  of  a  methodology  for  the 
requirements  analysis  of  safety— critical  systems.  The  approach  is  based  on  an 
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analysis,  at  each  level  of  abstraction,  of  the  risks  introduced  by  the  various 
decisions  (that  are  based  on  assumptions)  made  in  establishing  the  requirements. 
The  quality  analysis  follows  the  structure  of  a  traditional  safety  study  and 
incorporates  both  qualitative  and  quantitative  techniques. 

The  results  of  the  risk  analysis  provide  estimates  of  the  risk  associated  with  a 
specification  and  predictions  of  the  software’s  contribution  to  the  system  risk. 
These  results  are  used  to  guide  the  construction  of  robust  requirements 
specifications,  increase  the  confidence  (assurance)  that  the  level  of  risk  is 
acceptable  and  provide  the  basis  for  a  feasibility  study.  The  approach  to  risk 
analysis  brings  the  safety  studies  of  the  system  and  software  closer  together  and 
delineates  the  contribution  of  the  software  to  the  overall  system  risk.  Some 
aspects  of  the  approach  have  been  applied  to  a  train  set  example  [14]. 
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Abstract 

Experimental  studies  dealing  with  the  analysis  of  data  collected  on  families  of 
products  are  seldom  rqmrted.  In  this  paper,  we  analyse  the  failure  data  of  two 
successive  products  of  a  software  switching  system  during  validation  and 
operation.  A  comparative  analysis  is  done  with  respect  to:  i)  the  modifications 
peifoimed  on  system  components,  ii)  the  distribution  of  failures  and  cmrected 
faults  in  the  components  and  the  functions  fulfilled  by  the  system,  and  iii)  the 
evolution  of  the  failure  intennty  functions. 

1  Introduction 

Most  current  t^ix'oaches  to  software  reliability  evaluations  are  based  on  data 
collecuxl  aa  a  single  generation  of  products.  However,  many  triplications,  ikH  to  say 
the  great  majority,  result  from  evolutions  of  existing  software:  there  ate  families  of 
products,  the  various  generations  resulting  frcHn  evolutions  fw  implementing  new 
functionalities.  A  new  rr>proach  that  is  aimed  at  the  incrvpcMation  of  past  experience 
in  jvedicting  the  reliability  of  a  new,  but  similar,  software  has  recently  been 
proposed  in  [1].  This  rrr)roach  requires  the  identihcation  of  parameters  which 
characterize  past  expoience  to  be  incwporated  in  the  evaluatimi  of  the  software 
reliability.  Qearly,  the  identification  of  these  parameters  will  be  based  on  the 
analysis  of  data  collected  over  the  whole  family  of  products.  Experimental  studies 
dealing  with  the  analysis  of  families  of  products  are  seldom  report  [2, 3].  The  data 
considered  in  this  paper  were  collected  on  the  software  of  two  successive 
generations  of  the  Brazilian  Electronic  Switching  System  (ESS) — ^TROPICO. 
Throughout  this  piqier,  the  two  products  will  be  identified  as  PRA  and  PRB.  PRA 
was  first  developed  and  allows  connectitm  of  1S(X)  subscribers,  'fbe  processing 
capacity  of  the  TROPICO  system  was  subsequently  increased  with  the  release  of 
PRB  which  allows  the  processing  of  up  to  4096  calls.  Many  PRA  software 
components  have  been  reused  to  the  development  of  PRB  and  additional 
components  were  developed. 

The  failure  data  coUect^  on  each  one  of  these  products  have  been  considered 
respecti\ely  in  [4]  for  PRA  and  [S]  to  PRB.  While  our  previous  work  was  mainly 
devoted  to  reliability  analysis  and  evaluation,  this  paper  is  concerned  with  the 
qualitative  as  well  as  quantitative  analysis  of  the  failure  data.  Our  objective  is  to  do 
a  ccxnparative  analysis  of  the  two  successive  products  based  on  the  data  collected 
during  the  end  of  validation  and  the  beginning  of  operation.  Emphasis  will  be  put  on 
the  evoluticHi  of  the  software  and  the  corresponding  failures  and  corrected  faults. 
Ihis  paper  is  composed  of  five  sections.  Sectirai  2  gives  a  general  ov^view  of  the 
TROPICO  switcl^g  system.  It  describes  the  main  functions  performed  by  the 
system  and  presents  some  statistics  about  the  evedution  of  PRB  with  respea  to  PRA. 
^tion  3  describes  the  test  environment  and  the  failure  data  collected.  Section  4 
presents  smne  of  the  results  derived  frran  the  collected  data.  Finally,  Section  S 
outlines  the  main  results  obtained  fton  the  analysis  of  both  products. 
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2  Software  Description 

2.1  General  description 

Tbe  TROPICO  ESS  software  features  a  modular  and  distributed  structure  monitored 
by  microprocessors.  Tbe  software  can  be  decomposed  into  two  main  parts,  that  is, 
tbe  applicative  software  and  tbe  executive  software. 

Two  categories  of  conqxments  can  be  distinguisbed  in  the  TROPICO  ESS  software: 

i)  elementary  implenientation  blocks  (EIB),  which  fulfil  elementary  fimctioos  and 

ii)  groups  of  elementary  implementati<»  blocks  according  to  the  main  four 
functions  of  the  system.  These  groups  are: 

•  Telephtmy  (T^):  call  processing,  chargennetering,  etc. 

•  Defense  ^EF):  tm-line  testing,  traffic  measurement,  mor  detection,  etc. 

•  Interface  (INT):  communication  with  local  devices  (memories,  terminals),. . . 

•  Management  (MAN):  trunk  and  subscriber  signalling  tasks,  communication  with 
external  devices,... 

2J,  Evolution  of  PRB  with  respect  to  PRA 

The  development  of  PRB  started  while  PRA  was  under  validation.  Many  PRA 
con^ionents  have  been  reused  for  the  develofmmnt  of  PRB  and  additicmal  oats  were 
devdoped.  Three  types  of  EIBs  can  be  distinguished: 

•  new:  specifically  devdq^  for  PRB; 

•  modified:  developed  for  PRA  and  modified  to  meet  die  requirements  of  PRB; 

•  unchanged:  corresponding  to  PRA  EIBs  induded  in  PRB  without  modification. 
Figure  1  gives  the  number  of  EIBs  and  the  size  of  the  software  for  PRA  and  PRB. 
The  software  of  PRA  and  PRB  was  coded  in  Assembly  language.  A  10  percent 
increase  of  the  PRB  size  can  be  noticed  relative  to  PRA.  Only  4  new  EIBs  were 
developed  fcx’  PRB.  All  the  EIBs  of  PRA  have  been  reused  with  or  without 
modifications  for  PRB. 


Figure  1:  Number  of  EIBs  and  size  of  PRA  and  PRB 

Figure  2  shows  the  amount  of  modification  perfwmed  on  PRB  with  respect  to  tbe 
number  of  EIBs  and  to  the  size  of  tbe  software.  About  67%  of  PRB  code  results 
firmn  tbe  modification  of  the  PRA  code.  About  75%  of  tbe  modified  EIB's  belong  to 
the  ai^Ucative  software  and  84%  of  unchanged  EIB’s  to  the  executive.  Thus,  the 
increase  of  the  TROPICO  capacity  mainly  led  to  major  modifications  of  the 
applicative  software  with  only  minw  modifictukHis  of  the  executive. 

I^iien  considering  tbe  four  functions  and  tbe  distribution  of  tbe  three  types  of  EIB 
of  PRB,  we  notice  that  most  of  the  unchanged  modules  belong  to  INT  (about  60%). 


•)aoooftingk>lh«numb«rof  EIB*  b)  according  to  th«  aim  of  ElBa 

Figure  2:  Distribution  of  unchanged,  modified  and  new  EIBs  in  PRB 


232 


3  Test  Environment  and  FaUure  Data 

3.1  Test  Program 

The  software  test  program  for  TROPICO  ccmsists  of  four  steps:  1)  unit  test. 
2)  integration  test,  3)  validation  test,  and  4)  field  trial  test  The  first  three  steps 
corresptmd  to  the  test  phases  usually  defined  for  the  software  life  cycle.  Hdd  trial 
omsists  of  testing  a  prototype  in  a  real  envirooment  which  is  simil^  to  the  opera¬ 
tional  envircxunent  It  uses  a  system  configuration  (hardware  and  software)  thd  has 
reached  an  acceptable  level  of  quality  after  c(mq>leting  the  labc^ratory  tests. 

The  description  of  the  whole  quality  control  program  for  TROPICO  is  given  in 
[6,  7].  The  test  program  carried  out  during  validatitm  and  field  trial  test  is 
decomposed  into  four  kinds  of  test  (functional,  quality,  performance  and  ov^load 
tests).  PRA  and  PRB  validation  were  carried  out  accor^g  to  this  program.  Figure  3 
shows,  for  the  period  of  data  collection  oa  PRA  and  PRB,  the  length  of  validatimi  in 
months,  field  trial  and  (^>eration  phases.  As  can  be  seen,  no  field  trial  tests  were 
performed  for  PRB.  This  is  because  many  PRA  componoits  were  reused  fm:  die 
development  of  PRB,  and  PRB  was  put  in  operation  while  PRA  had  already  been 
operating  for  several  months. 
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Figure  3:  Valklatioa,  field  trial  and  operation  lengdi  for  the  period  of  data  collection  (mondis) 

During  the  operational  phase,  the  number  of  PRAs  and  PRBs  installed  on 
operational  sites  was  progres»vely  increased  (see  Figure  4).  At  the  end  of  the  data 
collection  period,  up  to  IS  PRAs  and  42  PRBs  had  bera  insudled. 


Figure  4:  Number  of  installed  sites  versus  time 


3.2  Data  Collection 

Handling  of  failure  data  affecting  the  TROPICO  ESS  is  through  use  of  an 

tqipropriate  failure  report  (FR)  sheet  containing  the  foUowing: 

•  date  of  failure  occurrence; 

•  origin  of  failure:  description  of  syston  configuration  in  which  the  failure  was 
observed  and  of  the  condidons  of  failure  occurrence; 

•  type  of  FR:  hardware,  software,  documentation  with  indication  of  affected 
elementary  implementation  Uodts; 

•  analysis:  identification  and  classificatitm  of  the  faulUs)  which  led  to  failure 
(coding,  ^tedfication,  interface,...); 
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•  solutions:  tbe  imposed  solutitxis  and  those  retained; 

•  modification  control  ctxitrol  of  the  (xxrected  elementary  inq>lementation  blocks; 

•  regression  testing:  results  of  the  tests  applied  to  the  corrected  elementary 
mq>]ementati<m  block(s). 

Only  one  FR  is  kept  per  observed  failure:  rediscoveries  are  not  recnded.  In  other 
w(n^  if  several  PR's  cover  the  same  failure,  only  one  (the  first)  is  entned  into  the 
database.  In  fact,  an  FR  is  bodi  a  failure  report  a^  a  correctitm  report  since  it  also 
contains  infcxmaticm  (m  the  £ault(s)  that  resulted  in  failure. 

The  results  imsented  in  the  following  sections  are  based  on  the  analysis  of  the  data 
collected  on  the  observed  failures  and  on  the  axrections  perfmned. 

4  Relationships  derived  from  the  data 

This  section  presents  and  discusses  some  (tf  the  results  obtained  frran  the  data. 

4.1  Statistics  on  failures  and  corrected  faolte  in  PRA  and  PRB 
Figure  5  gives  the  number  of  failures  and  ctxrected  faults  in  I^A  and  PRB.  It  can 
be  seen  that  less  failures  occurred  in  PRB  even  though:  i)  the  period  of  data 
collection  for  PRB  is  longer  than  that  of  PRA  (see  Figure  3)  and  ii)  a  greater 
number  of  systems  have  been  in  use  during  the  opoation  phase  (see  Figure  4). 
Furthermore,  the  number  of  avrected  faults  exceed  the  numbo*  of  failures.  This  is 
due  to  fact  that  some  failures  led  to  the  modification  of  more  than  one  EIB. 


*FR 

#  corractsd  faults 
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Figure  5:  number  of  failures  and  corrected  faults  in  PRA  and  PRB 

Figure  6  shows  the  statistics  concerning  the  number  of  EIBs  that  have  been 
corrected  because  of  a  software  failure.  As  can  be  seen,  the  results  fw  PRA  and 
PRB  are  similar.  Fcv  both  products  about  80%  of  tbe  failures  led  to  tbe  correctitm  of 
only  one  EIB.  Hiis  is  really  in  favOT  of  software  modularity  and  equally  shows  that 
there  is  little  failure  interd^ndenoe  among  EIBs. 

Tbe  analysis  of  the  data  corresponding  to  failures  involving  more  than  one 
component  allowed  us  to  identify  two  pairs  of  EIBs  that  are  strongly  dqiendent  with 
respect  to  failure  occurrence.  F(v  these  two  pairs,  we  noticed  that  tbe  probability  of 
simultaneous  modification  of  both  EIBs  given  that  a  failure  was  due  to  a  foult 
located  in  one  of  them,  exceeds  0.5.  This  result  was  obtained  for  both  PRA  and 
PRB.  This  type  of  analysis  can  be  of  a  great  help  for  software  maintenance.  It 
allows  software  debuggers  to  identify  the  stochastit^y  dependent  components  and 
to  take  into  account  diis  information  when  looking  for  Ae  origin  of  failures. 
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Figure  6:  Statistics  on  the  number  of  EIBs  affected  by  a  failure 

4  Distribution  of  failures  and  corrected  faults  per  functions 

Figure  7  gives  the  number  of  failures  and  cwrected  faults  attributed  to  the  four 

functions:  TEL,  DEF,  INT  and  MAN  (as  defined  in  Section  2.1).  The  sum  of  failure 
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repons  attributed  to  the  functioiis  is  higho’  than  the  total  number  of  failure  rqxMts 
indicated  in  Bgure  5:  this  is  because  when  a  failure  is  due  to  die  activation  ci  faults 
in  differmt  functioos,  an  FR  is  attributed  to  each  of  them. 


Figure  7:  Failure  reports  and  corrected  faults  in  TEL,  DEF,  INT  and  MAN 

When  looking  at  the  distribution  of  corrected  faults  per  functions  (Figure  8),  we 
obtain  similar  figures  for  both  products,  in  particular  DEF  and  INT.  It  can  be  seen 
diat  most  of  the  corrections  were  performed  in  TEL  and  INT.  This  can  be  explained 
by  the  fact  that  these  functions  are  more  activated  than  DEF  and  MAN. 


Figure  8:  Fault  distribution  in  TAP,  DEF,  INT  and  MAN 

Furthermore,  most  of  the  failures  repmled  led  to  the  modification  of  only  one 
function  (90  %).  Among  the  465  FRs  (resp.  210  FRs)  recorded  for  PRA  (re^. 
PRB),  only  54  FRs:  31  during  validaticm,  10  during  field  tests  and  13  during 
operation  (resp.  21  FRs:  10  during  validation  and  11  during  operation)  led  to  the 
modification  of  mote  than  one  function.  This  riiows  that  the  functions  are  not  totally 
indqiendent  with  respect  to  failure  occurrence,  although,  only  a  weak  dqiendence 
was  observed.  Note  that  this  result,  compared  to  those  repoted  in  Section  4.1, 
shows  that  less  dependence  is  observed  between  functions  than  between  EIBs. 

43  Distribution  of  PRB  faults  per  EIB  type 

Figure  9  shows  the  distribution  of  axrected  faults  in  PRB  when  considering  the 
unchanged,  modified  and  new  EIBs.  Thus  more  than  80  percent  of  corrected  faults 
were  attributed  to  modified  EIBs.  It  is  notew(»tby  that  almost  the  same  distribution 
was  obtained  when  considering  data  frcxn  validation  of  from  operation  only. 
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Figure  9:  Distribution  of  FR  per  t3rpe 


When  reviewing  the  mean  values  of  fault  density  (number  of  faults  per  Kbyte)  in 
the  three  types  of  EIBs,  we  obtain  the  following  figures:  0.95  for  modified  EIBs, 
0.75  for  new  ones  and  0.49  fn*  unchanged  (mes.  One  may  think  fbat  the  modified 
EIBs  are  mtxe  errm:  prtme  than  the  new  and  unchanged  ones,  and  woidd  conclude 
that  it  is  better  to  create  new  cmnponents  than  to  modify  already  existing  ones. 
However,  we  should  be  careful  when  analysing  this  type  of  result  In  fact,  as  only  4 
new  EIBs  were  developed  for  PRB,  no  significant  co^usions  with  re^)ect  to  this 
particular  point  could  be  derived  frmn  this  analysis. 

An  analysis  of  the  average  values  of  fault  deiisity  presented  in  Rgure  10  shows  a 
significant  decrease  of  the  fault  density  of  PRB  EIBs  when  C(m:q>ared  to  PRA.  Also, 
it  can  be  seen  that  the  fault  density  of  all  unchanged  and  modified  PRB  EIBs 
significantly  decreased  when  compared  to  the  values  computed  fw  PRA.  This 
indicates  an  enhancement  of  the  quality  of  the  software.  The  experience  cumulated 
during  the  validation  and  operational  use  of  PRA  leaded  to  a  better  und^'standing  of 
the  system  and  contributed  to  the  iminrovement  of  the  quality  of  PRB  code. 


a)  unchanged  EIBs  b)  modified  EIBs 


Figure  10:  Fault  densities  of  PRB  EIBs  compared  to  PRA 
44  EIB  size  and  fault  density  in  PRA  and  PRB 

Scatter  plots  of  fault  density  per  EIB  (number  of  faults  per  Kbyte)  versus  the  size  of 
the  EIB  were  plotted  fix’  PRA  and  PRB.  It  was  difficult  to  ascotain  any  trend  within 
these  plots.  Our  objective  was  to  analyse  a  possible  significant  dependence  betweoi 
the  EIB  fault  density  and  their  size. 

Figure  11  gives  for  PRA  and  PRB  the  fault  density  average  values  for  three 
categories  of  EIB  size.  The  fault  density  is  almost  cmistant,  it  is  around  2  faults  per 
Kbyte  for  PRA  and  1  fault  per  Kbyte  for  PRB.  Hiis  illustrates  the  imixovement  of 
the  quality  of  PRB  code  with  resp^  to  PRA  and  thus  confirms  the  results  reptxted 
in  Section  4.3.  As  the  size  of  PRA  and  PRB  EIBs  is  measured  in  Kbytes  and  not  in 
kilo  lines  of  code,  it  is  difficult  to  c(xiq)aie  these  values  to  other  fault  density  values 
obtained  which  ate  reported  fix  instance  in  [8, 9]. 
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Figure  11:  Avenge  values  of  PRA  &  PRB  fault  density  versus  EIB  size 


Figure  12  shows  that  the  PRB  modified  EIBs  exhibit  high^  fault  densities  on 


*  Note  that  the  fault  density  as  defined  here  is  different  from  the  commonly  used  one  (i.e., 
number  of  faults  per  kilo  lines  of  code);  the  latter  is  not  available  for  this  application. 
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average  than  unchanged  EIBs.  However,  it  should  be  noticed  that  the  number  of 
EIBs  in  each  category  of  size  is  small.  Ako,  it  can  be  seen  that  most  of  unchanged 
EIBs  have  a  small  size  Oess  than  S  Idbytes)  compared  to  modified  EIBs. 


Sim 

mocfifiMi 

EIB«im>1SKb 

KELZaiDH 

10  Kb  <EIB  8izB<  IS  Kb 

0.75  (1  EIB1  1 

5  Kb  <EIB  «ze<  10  Kb 

EIBai»<SKb 

Figure  12:  Fault  density  average  values  of  modified  and  undianged  PRB  EIBs  vnsus  size 

43  Evolution  of  failure  occurrences  with  respect  to  time 

Hguie  13  shows  the  evolution  of  the  failure  intensities  of  PRA  and  PRB  during  the 
period  of  data  collection:  for  both  products — even  though  the  failure  intensity  is 
globally  decreasing  during  the  (^)efki<mal  phase — the  trend  is  not  monotone.  The 
local  variations  observed  are  due  to  the  {Hogressive  installatim  of  new  systems  (see 
Figure  4).  It  is  noteworthy  that  the'impact  of  the  number  of  (^rational  systems  on 
the  evolution  of  the  failure  intensity  has  been  reported  in  several  pt^rs,  see  fca* 
instance  [10, 11]. 

In  OTder  to  evaluate  the  reliability  of  PRA  and  PRB  as  usually  perceived  by  the 
users,  we  need  to  consida  the  failure  intensities  cme^xaiding  to  an  average  system 
(i.e.,  the  failure  intensity  divided  by  the  number  of  systems  in  use).  Hgure  14  shows 
Ae  evolution  the  failure  intensities  of  PRA  and  P^  fm*  an  average  systmn.  It  can 
be  seen  that  the  failure  intensities  of  both  {voducts  decreased  globally  during 
operation  thus  exhibiting  reliability  growth. 


In  order  to  compare  tbe  reliability  of  PRA  and  PRB«  we  plot  in  tbe  same  figure 
(Hgure  15)  the  failure  intensities  obso^ed  for  an  average  system  during  operation. 
Unexpectedly,  tbe  reliability  of  is  w<xse  than  that  oi  PRA.  The  same  holds  fm* 

the  groups  of  functions  lEL,  DEF  and  INT  (Bgure  16).  This  is  smprising  because, 
as  PRB  has  been  developed  fnxn  PRA  whi^  has  been  validated  and  extmisivdy 
used  one  would  anticipate  that  its  reliability  would  be  better  than  that  of  PRA.  Hiis 
may  be  explained  by  die  fact  that  major  modifications  had  been  perfumed  on  PRA 
in  mder  to  adapt  die  system  to  the  new  ^lecifications  and  no  field  trial  test  had  been 
performed  befue  tbe  introduction  of  the  system  in  the  field.  Note  that  about  80  %  of 
PRB  failures  recorded  during  operation  occurred  during  tbe  first  year  of  qieiatioiL 


Figure  IS;  PRA  and  PRB  failure  intensities  for  an  average  system  during  operation* 
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Figure  16:  Component  failure  intensities  for  an  average  system  during  operation* 


Typically,  a  new  system  experiences  a  maturing  period  during  which  its  reliability  is 
relatively  low  but  afterward,  reliability  keeps  improving  and  becomes  better  than 


* 


Note  that  fm  Hgures  IS  and  16,  the  X-axis  indicates  the  number  of  months  since  die 
system  was  put  in  operation. 


diat  of  its  predecessors.  In  fact,  if  we  kxdc  at  the  long  term  evolution  of  the  failure 
intensiQr  ftmctions  of  PRA  and  PRB  (see  Figure  17)  it  can  be  seen  that  the  residual 
failure  rate  of  PRB  evaluated  by  the  hyp^xponential  model  [12]  is  less  than  the 
residual  failure  rate  evaluated  fm*  PRA.  Similar  results  are  also  obtained  for  lEL, 
DEF,  and  INT  (Figure  18).  For  both  products,  the  evaluations  are  based  on  the  tiata 
collected  durirg  die  last  year  of  opnation. 

It  is  noteworthy  that  the  same  was  noticed  in  [10]  for  successive  releases  of  a  wide- 
distributimi  so^are  product  and  in  [2]  for  three  successive  i»oducts  of  a  family 
ultra-available  cmnputers  designed  by  AT&T  Bell  Labmatimes. 


Hgure  17:  Eitiinatioa  of  PRA  St  PRB  failure  intensities 


— m — 

PRB 

TEL 

2.6  10-5 /h 

1.2  10-«/h 

DEF 

4.3  KT* 

1.4 10-5  yh 

INT 

4.2  10-®/h 

2.910r5/h 

MAN 

1.410‘*/h 

8.5  10^ /h 

Figure  18:  Residual  failure  rates  evaluated  by  the  hyperexponential  model 


5  Concluding  Remarks 

The  data  considered  in  this  pt^r  allowed  us  to  analyse  the  evcdution  of  die  software 
and  the  failures  of  two  consecutive  (U'oducts  of  the  TROPICO  ESS.  The  main 
results  derived  are  as  follows: 

•  A  high  percentage  of  failures  was  attributed  to  modified  EIBs. 

•  Fm-  both  products,  about  oO  %  (re^.  90  %)  (^  the  failures  led  to  the  cmrecticHi 
of  only  one  EIB  (resp.  function).  Himefore,  only  a  weak  dependence  with 
respect  to  failure  occurrence  was  observed  between  components. 

•  The  fault  density  of  PRA  and  PRB  is  almost  constant  with  respect  to  size.  It  is 
about  2  faults  pa  Kbyte  for  PRA  and  1  fault  per  Kbyte  for  PRB. 

•  The  fault  density  values  of  all  modified  and  undianged  I^  EIBs  are  lower  than 

those  PRA  EIBs.  This  shows  an  imixovement  of  the  quality  of  PRB  code  with 

respeatoPRA. 

•  Ckxiqiarison  of  the  PRA  and  mB  failure  intensities  during  qieration  shows  that 

experienced  a  maturing  period  during  which  its  relial:^ty  was  relatively 
low  but  afterwards,  its  reliability  improved  and  became  beam’ than  that  of  PRA. 
Hie  cooqiaiative  analysis  provides  insist  into  the  evtdution  of  the  software  and  the 
reliability  of  two  successive  products  of  the  TROPICO  ESS.  However,  the  results 
obtained  did  not  allow  us  to  identify  the  various  factors  that  influence  the  evolution 
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of  the  relial^ty  oi  a  family  of  products.  In  order  to  reach  this  objective,  addititmal 
information  is  needed  concerning  for  instance:  i)  the  develoinnent  process  and  ii) 
more  than  two  successive  generations  of  products.  Furthermore,  the  collection  and 
analysis  of  sevrnal  failure  data  sets  relative  to  different  families  of  products  will  be 
of  g^  he^  in  this  learning  phase. 
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Abstract 

For  the  software  validation  with  CASE-tools  this  p^r  gives  an  overview  about 
the  requiremmits  for  the  software  and  the  different  validation  methods  for  safety 
critical  software.  The  requirements  and  validation  methods  are  discussed  on  a 
case  study  with  CASE-tools.  Also  a  critical  assessment  on  the  use  of  CASE-tools 
is  given. 

1  Introduction 

The  department  of  the  author  works  since  a  lot  of  years  on  the  field  of  system 
validation  of  electronic  equipment  used  hy  the  surface  traffics  (e.g.  the  railw^). 
The  complexity  of  control  plications  for  many  systems  has  nowadays  grown  up 
so  that  conqmter  tystems  are  required  and  software  is  used  in  the  tystem  ele¬ 
ments.  In  safety  critical  tystem  elements  the  inq)lemented  software  has  to  be  a 
"safe  software".  The  efforts  on  achieving  safety  in  software  are  going  in  different 
directions.  Measures  can  be  enqrlqyed  to  avoid  errors  in  the  design  process  of 
software  or  to  neutralize  safety  critical  effects  of  possible  errors.  To  discover  er¬ 
rors  in  the  software,  validation  methods  are  required.  This  paper  will  discuss  on  a 
case  study  of  software  validation  with  CASE-TOOLS  the  safety  requirements  for 
the  software  and  software  validation  methods. 

2  Safety  requirements  of  the  software 

Before  paking  about  the  validation  methods  it  should  be  clear  what  are  the  re¬ 
quirements  for  the  safety  relevant  software.  These  requirements  will  be  shortly 
discussed  in  this  chper.  The  requirements  concern  a  big  group  of  software  attri¬ 
butes.  They  could  be  split  under  different  points  of  view,  e.g.  in  the  following 
way: 

Requirements  for  the  scrftware  as  a  consequence  of  requirements  for  the 
complete  conqmter  system. 
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Reqiurements  for  the  structure  of  the  software. 

Requirements  for  the  coding. 

-  Requirements  for  the  i»otective  measures  against  errors. 

Requirements  for  the  documentation. 

Other  requirements. 

Short  exanq)les  for  the  requirements  will  be  given  in  the  following  chapter.  The 
enumerated  adjects  for  each  subgroiq)  of  requirements  are  only  exemplary. 

2.1  Requirements  for  the  software  as  a  consequence  of  require¬ 
ments  for  the  complete  computer  system. 

The  requirements  for  the  complete  computer  ^stem  have  a  great  influence  on 
the  safety  concept  of  the  software.  The  following  requirements  result  from  the 
structure  of  the  whole  system:[l][2] 

It  should  be  clear  if  the  computer  sy^em  has  a  redundant  structure,  or  di¬ 
versity,  if  the  system  concqa  is  a  fault  tolerance  or  fault  avoidance  con¬ 
cept. 

The  interfaces  to  other  systems  must  be  specified  in  a  clear  and  simple 
way. 

The  safety  relevant  part  of  the  software  shall  be  separated  from  the  non 
critical  software. 

The  number  of  intemqrts  should  be  minimized  to  simplify  the  validation. 

2.2  Requirements  for  the  structure  of  the  software. 

The  structure  of  the  software  has  a  great  influence  on  the  complexity  of  the  vali¬ 
dation  process  and  on  the  possibility  of  errors.  Therefore  the  following  require¬ 
ments  should  be  considered:{l][2] 

The  software  shall  consist  of  small  modules  for  simplifying  the  valida¬ 
tion. 

The  software  elements  shall  only  be  sequences  of  statements,  loops,  con¬ 
dition  clauses. 

2.3  Requirements  for  coding. 

The  coding  suction  is  the  transfer  from  a  logical  structured  program  (structure 
diagram  Nassi  Shneiderman  diagram)  into  an  executable  program  language.  At 
this  stage  the  errms  that  can  occur  would  be  tystematic  errors. 


To  avoid  errors  in  this  stage  the  following  aspects  shall  be  obsetved:[2] 

The  variables  must  be  clearly  separated  in  output,  iiq}ut,  output/iiq>ut 
variable,  global  and  local  variables. 

The  addressing  of  variables  and  jumps  must  be  clear.  Complex  calculat¬ 
ing  of  the  conditions  and  addresses  for  branches  and  jumps  shall  be 
avoided. 

Jumps  (branches)  shall  only  go  to  the  beginning  of  a  loop. 

The  next  executed  statement  after  the  end  of  a  loop  or  subroutine  must  be 
the  next  statement  after  the  call  of  the  loop  or  subroutine. 

Dynamic  modifications  of  instructions  in  the  (q)erational  programs  shall 
not  be  allowed. 

It  must  be  assured  that  used  compilers  do  not  generate  new  errors  into 
the  code. 

2.4  Requirements  for  the  protective  measures  against  errors. 

This  measures  could  help  to  iediK%  the  error  rate  and  can  be  used  to  detect  er¬ 
rors  in  an  early  process  stage: 

Use  of  diverse  software 
Using  redundant  bits  for  coding. 

Mutual  comparison  of  checksums  from  parallel  channels. 

The  software  module  shall  run  automatic  tests  in  specified  time  intervals. 

2.5  Requirements  for  the  documentation. 

The  fulfillment  of  the  requirements  for  the  documentation  is  important  for  the 
understanding  and  it  siq^rts  the  testability  of  the  program.  The  documentation 
will  be  also  important  for  the  reusability  of  software  units.  The  most  important  re¬ 
quirements  for  the  documentation  are  the  following  ones;  [3] 

Tte  minimum  documentation  for  the  code  is  the  text  of  the  equivalent 
step  in  the  structure  diagram 

The  documentation  of  each  statement  must  be  clear,  redundant  informa¬ 
tion  shall  be  avoided. 

The  use  of  each  variable  shall  be  described  exactly,  which  procedures  use 
these  variables,  and  to  which  physical  address  do  the  variables  corre¬ 
spond. 

The  structure  diagrams  for  each  software  module  shall  be  clearly  ar¬ 
ranged,  and  the  connection  between  the  modules  shall  be  obvious 
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The  documentatioa  shall  correqx>nd  to  the  newest  version  of  the  code. 

2.6  Other  requirements 

The  leqoiiements  which  don't  fit  into  the  above  enumerated  classification  are 
fiv  exanqrie:  [4] 

The  requirements  demanded  by  the  shut  down  procedure  of  the  system 
and  the  restarting  procedure. 

The  program  segments  shall  have  the  same  size  if  overlays  are  used.  If 
program  segments  require  less  memory,  the  unused  memory  shall  be 
filled  with  a  defined  bit  pattern. 

Critical  bit  patterns  (e.g.  all  bit  Os  or  Is)  shall  be  avoided.  The  output 
from  defect  pieces  of  hardware  has  these  patterns. 

3  Validation  methods 

In  this  chapter  a  short  overview  is  given  about  the  validation  rr^thods  and  a 
classification  of  the  different  methods  is  carried  out.  However,  the  fulfillment  of 
the  above  mentioned  requirements  does  not  give  enough  guarantee,  that  there  are 
no  errors  in  the  software.  Therefore  each  software  product  has  to  be  validated. 
The  validation  methods  are  widely  indqrendent  from  the  field  of  plication.  The 
methods  can  be  subdivided  in  two  main  subgroiq>s:  'l>ladr  box"  validation  meth¬ 
ods  (also  called  fiinctitmal  validation)  and  "white  box"  validation  methods  [5], 
which  consist  of  qualitative  and  quantitative  ones.  Both  subgroiq)s  include  static 
and  dynamic  validation  methods.  (The  division  of  each  subgroup  into  static  and 
dynamic  methods  has  been  discussed  in  [6] ) 

3.1  Black  box  testing 

The  bladr  box  testing  focuses  on  the  reaction  of  the  test  object,  which  depends 
on  the  different  external  parameters  of  the  test  object  [7][8] 

Functions  tests 

Boundary  value  tests 

At  this  method  the  boundaries  and  extremes  of  the  input  do¬ 
mains  are  tested  if  there  is  a  coincidence  with  the  qrecifications. 
The  use  of  the  value  zero  (direct  as  well  as  in  indirect  transla¬ 
tion)  shall  be  included  in  these  tests. 
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Piohdrilistic  tests 

With  this  method  the  distribution  of  the  ii^iut  data  shall  be  sim¬ 
ulated.  At  this  test  also  data  out  of  the  specified  domains  shall 
be  included. 

Input  output  requirements  tests 

The  fulfillment  of  the  iiq>ut  output  requirements  is  proved  by 
conqxuing  the  output  data  with  the  specified  data. 

Interface  tests 

With  this  test  method  errors  in  sul^rograms  and  errors  that  can 
lead  to  fiiilures  in  particular  plications  shall  be  found  out. 
This  is  realized  boundary  value  tests  and  probabilistic  tests. 
Performance  tests 

At  this  test  the  boundary  of  the  i^stem  efficiency  is  tested. 

3.2  White  box  testing 

At  the  white  box  methods  the  tests  are  focused  on  the  structure  and  the  internal 
parameters  of  the  test  object  [7]  [8]  [9] 

Analytical  methods 

Semantic  analysis 

There  a  relationship  between  the  output  and  input  variables  is 
delivered. 

Compliance  analysis 

It  helps  to  find  out  differences  in  use  of  functions,  variables, 
procedures  against  the  specifications  in  the  program. 

Structural  analysis 

The  structural  analysis  is  used  to  find  out  jumps  into  a  loop  that 
are  not  allowed,  or  unreachable  statements 
Control  flow  analysis 

This  method  is  used  to  find  out  inaccessible  code  segments  (un¬ 
conditional  jumps  that  leaves  statements  unreachable) 

Data  flow  analysis 

The  data  flow  analysis  helps  to  find  variables  that  are  read  be¬ 
fore  written,  or  to  find  variables  that  are  written  more  then  once 
without  reading,  or  variables  that  are  written  but  never  read. 
Listing  inpction 

At  the  listing  inspection  the  program  is  reviewed  concerning  in¬ 
consistency,  incompleteness  of  development  directions. 
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Walkthiough  method 

Hence  the  test  of  the  program  focuses  on  finding  out  contradic¬ 
tions  by  carrying  out  the  fimctions  mentally  in  a  group. 

Syntax  check 

This  method  is  used  to  find  out  if  the  declarations  of  the  vari¬ 
ables,  types,  fimctions,  procedures,  are  correct  and 
to  find  out  if  the  sequence  of  variables  (input/output)  is  correct. 

Time  testing 

With  this  validation  method  the  worst  case  in  adjustment  of  running 
times  will  be  tested  to  find  out  collisions  in  running  times. 

Analysis  of  memory  access 

This  method  shall  find  out  if  some  software  modules  write  to  a  memory 
area  that  is  already  reserved  for  a  variable  used  by  ott^r  procedures.  It 
can  also  be  usable  for  probabilistic  analysis  of  the  internal  variables. 
Specification  test 

This  test  proves  the  fulfillment  of  the  specification. 

Structural  test 

This  test  shall  find  out  if  the  structure  of  the  software  is  appropriate  to 
the  structure  of  the  specification. 


4  Case  study  with  CASE-tools 

4.1  Preparation  phase 

The  used  CASE-tools  cannot  interpret  an  assembler  language.  They  use  a  spe¬ 
cial  language  The  souTce  code  has  to  be  translated  into  the  CASE-tool  specific 
language.  The  translation  process  can  be  simplified  by  realizing  a  model  of  the 
processor  (written  in  the  CASE-tool  specific  language)  that  simulates  the  used  in¬ 
structions. 

The  iiqnit  file  needs  some  additional  informations  e.g.  procedure  specifications, 
mainprogram  specifications,  function  specifications,  derive  relationships,  assert 
statements.  The  derive  relationships  simplifies  the  analysis  at  a  procedure  call. 
The  assert  statements  can  be  used  for  r^ning  the  analysis.  These  informations 
have  grrat  influence  to  the  results  of  the  analysis. 

4.2  Analysis  phase 

The  output  (rf'the  CASE-tool  depends  on  the  specification  in  the  command  line, 
(which  keywords  were  used).  The  CASE-tools  that  have  been  used  cover  (see  also 
3. 1  White  box  testing)  the  control  flow  analysis,  data  use  analysis,  information 
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flow  analysis,  semantic  anafysis  and  compliance  analysis.  The  infonnation  flow 
analyser  delivers  as  result  the  dependency  of  the  variables  from  the  used  variables 
and  constants.  For  each  variable  the  dependency  from  conditional  nodes  is  stated. 
Also  a  list  of  possible  errors  and  redundant  statements  is  given. 

At  the  conq)liance  analysis  the  relations  between  iiqmt  and  ouQnit  variables  are 
calculated  and  compared  with  the  specifications  at  the  begin  and  the  end  of  the 
program  block.  These  specifications  have  to  be  inserted  1^  the  operator  of  the 
CASE-tool.  The  quality  of  the  result  from  this  analysis  dq)ends  on  these  q)ecifi> 
cation  statements.  By  implementing  more  conditions  in  the  code  the  result  of  the 
analysis  can  be  simplified. 

The  control  flow  analysis  is  used  to  find  out  the  structure  of  the  code  and  to  find 
out  unreachable  statements,  multiple  entries  into  loops.  The  control  analyser  sim¬ 
plifies  the  gr^h.  The  stage  of  siiiq>lification  (if  only  sequences  of  nodes  are  re¬ 
moved,  or  also  self  loops)  can  be  controlled  by  the  used  keywords. 

The  data  use  analysis  shows  how  often  a  variable  is  read  before  written.  Hence 
variables  which  are  written  more  then  once  without  reading  could  indicate  omit¬ 
ted  code.  Also  the  data  use  analysis  shows  if  variables  have  been  written  and  nev¬ 
er  read.  This  could  indicate  redundant  code.  As  result  also  possible  errors  are  stat¬ 
ed,  which  have  to  be  confirmed  by  the  user  of  the  CASE-tool. 

The  semantic  analysis  generates  the  relation  between  input  and  output  variables 
of  each  executable  path.  The  user  of  the  CASE-tool  has  to  compare  the  results  of 
the  semantic  analysis  with  the  requirements  specificLlion. 

4.3  Evaluation  phase 

This  is  the  most  difiicult  section  of  the  validation  with  CASE  tools.  There  the 
results  of  the  different  analysis  methods  have  to  be  compared  and  conclusions 
must  be  made. 

Some  problems  about  evaluating  are  given  in  this  chapter.  For  example  the 
communication  between  the  individual  subroutines  of  the  tested  software  is  real¬ 
ized  for  maiQ^  times  by  using  the  accumulator  and  flag  register.  In  this  case  the 
CASE-tools  can  deliver  an  error  statement  that  the  register  is  not  defined.  This 
handover  procedure  was  not  done  randomly  it  was  used  systematically.  There  a 
violation  of  the  software  requirements  occurs.  The  main  question  here  is  now  is 
this  violation  acceptable  or  shall  it  be  treated  as  a  safety  critical  violation.  The  de¬ 
cision  whether  the  use  of  a  register  as  variable  is  acceptable  or  not,  has  to  be  made 
by  the  proofing  person  and  the  orderer  of  the  validation  (contractor). 

At  a  procedure  the  data  use  analyser  indicated  that  a  variable  was  written  for 
sometimes  with  no  intervening  read.  A  review  of  the  procedure  showed  that  these 
writing  actions  to  the  variable  were  correct.  The  used  processor  model  simulates 
the  flag  register  by  using  boolean  variables  for  each  flag.  According  to  the  opera- 
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tion  of  the  fffocessor  the  flags  have  to  be  set.  To  avoid  such  results  the  CASE- 
tool  the  variables  which  should  be  proved  can  be  selected.  This  option  simplifies 
the  analysis  results.  It  should  be  taken  in  account,  that  the  selection  of  the  vari¬ 
ables  is  a  critical  decision  done  by  the  user  of  the  CASE-tool. 

The  software,  that  has  been  validated  by  the  department  of  the  author,  has  as 
documentation  of  the  code  only  a  structure  diagram  (Nassi  Shneiderman  Dia¬ 
gram)  and  an  incomplete  variable  list.  It  is  clear  that  the  requirements  concerning 
documentation  of  the  software  were  not  fulfilled.  The  question  is,  shall  the  soft¬ 
ware  be  treated  as  a  safe  software  or  as  unsafe  software.  The  incomplete  docu¬ 
mentation  will  increase  the  necessary  time  for  the  validation.  The  presentation  of 
the  validation  report  has  been  discussed  by  G.List[10]. 

4.4  Advantages  for  the  use  of  CASE-tools 

The  use  of  CASE-tools  for  software  validation  supports  a  formalizing  of  the 
analysis  results.  This  formalizing  make  the  analysis  of  the  code  easier.  To  utilize 
the  simplification  of  the  analysis  the  CASE-tool  user  has  to  investigate  some  time 
into  the  preparation  of  the  code  before  using  the  CASE-tool  (see  also  4. 1  prepara¬ 
tion  phase).  The  time  profit  Tpr  t^  using  CASE-tools  can  be  described  mathe¬ 
matically  as  time  profit  margin 

Tpr  (t,  tease)  *  t  - 1  case- 

Where  t  is  the  required  time  for  validation  without  CASE-tools  and  tease  the  re¬ 
quited  time  for  validation  with  CASE-tools.  The  time  profit  depends  mainly  on 
the  efficiency  use  of  the  CASE-tools. 

The  use  of  CASE-tools  has  not  only  an  influence  on  the  validation  time,  it  also 
influence  the  rest  error  rate  of  the  validation.  The  rest  error  ratio  of  the  validation 
will  be  reduced  by  using  CASE  tools.  This  quality  improvement  depends  mainly 
on  the  person  who  carries  out  the  preparation  of  the  validation  object  and  the  as¬ 
sessment  of  the  analysis  results.  The  rest  error  ratio  rr  can  be  quantitatively  de¬ 
scribed  as 

Nu 

N 

where  N  is  Uk  number  of  all  items  and  Nu  is  the  number  of  all  undetected  errors. 
The  quantitative  view  of  the  error  ratio  has  been  discussed  in  more  detail  by  A. 
Sethy  [11][12].  The  quality  improvement  can  be  described  as  the  ratio  of  the  rest 
error  ratio  withoiU  CASE-tools  and  the  rest  error  ratio  with  CASE-tools 

fr 

V(r|-  ,rr<ase  )  “ 

•r-^ase 

called  also  improvement  factor  V  [13].  The  time  profit  and  the  quality  improve¬ 
ment  can  also  be  seen  in  the  economical  view.  The  CASE-tools  represent  a  big- 
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ger  investment  finr  finns.  Theiefote  a  economical  justification  fin  such  investment 
is  required.  The  C  ASE-tools  can  be  used  for  exanq)le  5  years.  In  this  time  interval 
the  user  has  to  do  M  numbers  of  validation. 

The  costs  per  validation  Cvai  are  then 

INV 

Cval  ~ 

M 

where  INV  is  the  investment  for  the  CASE-tool  (including  costs  for  training  and  J 

price  for  CASE-tool).  This  costs  per  validation  can  be  transfered  in  a  time 
equivalent  Tval  as  followed 

Cval 

Tval  ~ 

Cman 

where  C^mm  is  used  for  the  cost  manpower  per  hour. 

The  investment  of  the  CASE-tool  will  be  justified  if  following  condition  is 
fiilfiUed: 

Tpr  (t,  tease)  ^  Tyal 

in  words;  the  time  profit  has  to  be  greater  then  the  time  equivalnt  Tyai-  The 
quality  inq)iovement  of  the  rest  error  ratio  can  be  quantified  economically  by 
using  the  mean  costs  of  error  consequences.  The  quantification  of  these  costs 
depends  on  the  user  of  the  validation  object  and  tl^  can  vary  in  a  wide  range. 

Discussing  this  prc^lem  would  go  far  byond  this  paper  and  cannot  be  done  here 
therefore. 

For  software  metrics  some  usable  results  can  be  easily  obtained  by  the  use  of 
CASE-tools.  The  property  of  the  analysis  results  are  well  q)ecified,  so  that  fin  ex¬ 
ample  the  complexity  of  the  software  can  be  determined  reproducibly. 

5  Conclusion 

The  use  of  CASE-tools  makes  the  analysis  of  the  code  easier.  As  it  has  been 
shown  in  this  article,  the  use  of  CASE-tools  for  software  validation  delivers  an 
inqirovement  of  the  time-  and  the  quality  abject.  It  should  be  considered  that  the 
time  for  the  analysis  is  reduced  and  the  time  for  assessing  of  the  results  increase. 

The  quality  improvement  of  the  validation  results  depends  also  on  the  person, 
who  validates  the  software  (see  also  cluq>ter  4.4). 

As  it  has  been  discussed  above,  the  CASE-tools  cover  methods  of  the  white  box 
testing  groiq).  For  a  cmnplete  validation  of  a  safety  critical  software  some  meth¬ 
ods  (rf*  the  black  box  testing  group  also  must  be  carried  out.  The  CASE-tools  will 
point  at  possible  errors  in  the  code.  These  errors  have  to  be  confirmed  by  other 
validation  methods.  The  use  of  CASE-tools  may  replace  some  parts  of  the  con¬ 
ventional  test  methods.  Ifowever,  it  must  be  clear  that  the  understanding  of  the 


code  is  still  neoessaiy. 

Hopefully  it  was  possiUe  to  show  in  this  psfei  that  the  use  of  CASE-tools  can 

sinqriify  the  life  of  the  validating  person. 
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Abstract 

The  key  element  of  dependable  distributed  systems  is  the 
communication  strategy.  Communication  between  distributed 
and/or  redundant  system  components  (jarocesses)  may  use 
standard  network  tools  and  {mxxocoIs.  The  absence  of  a 
multicast-support  in  the  ISO  netwtxk  model  above  the  Network 
Layer  requires  qiecial  provisions  fta*  software  that  must  distribute 
data  over  a  network  to  an  unknown  number  of  netw(»k  partners. 
A  method  is  presented  which  combines  the  benefits  of  different 
network  layers  to  cover  the  needs  of  such  a  distributed, 
dependable,  and  redundant  system. 


1  Introduction 

The  Austrian  Research  Center  Seibosdorf  (ARCS)  has  specified,  designed  and 
im|4emented  the  distributed  security,  alarm  and  control  system  called  CSS 
(Sj^deable  purity  System)  for  Philips  Indtstry  [1].  The  task  of  diis  system  is  to 
protect  an  area,  a  plant  or  a  building  complex  fiom  threats  fiom  the  environment 
(therefore  is  is  sometimes  called  a  “Risk  Management  System”).  The  [voperties  of 
such  a  system  depend  on  its  ability  to  get  information  abtmt  the  environment  and  its 
inner  status  (by  peripheral  subsystems,  sensors,  ^.)  and  the  thrustworthiness  of  the 
CSS  itself.  The  system  has  as  primary  goals  high  availability  and  scaleability  (i.e. 
configurable  freely  within  any  topology,  and  network).  One  of  the  key  ideas  of  the 
concept  is,  that  processes  and  processors  may  be  distributed  freely  according  to  die 
principles  enumerated  above,  and  the  configuration  is  scaleable  from  a  single 
workstation  to  a  redundant  network  of  n  processors.  Dependability  and  fault 
tolerance  [2,  3]  are  implemented  via  distributed,  redundant  (software-)  processes 
using  a  multilayer  software  structure  for  communication  and  message  exchange  and 
functional  software  interfaces  between  the  various  external  subsystems.  One  topic  is 
the  communication  strategy  chosen  to  suppmt  any  redundant  hardware  and  software 
structure,  fail-ovor  strategies,  and  dynamic  reconfiguration.  The  experiences  with 
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llie  ifi4)leiiieiitation  the  commimication  mechanisms  on  different  levels  of  the 
undnlying  network  protocol  and  dte  problons  encountered  during  the 
implementation  and  inscdlation  phase  under  hard  real-time  and  load  restrictions  will 
be  discussed. 

2  System  Overview 

The  deagn  and  development  of  the  CSS  system  started  in  1988,  when  Philips 
Industry  decided  to  (dan  a  new  “Alarm  and  Gontnd  System”  which  diould  take  the 
place  of  the  hith^  existing  PDPll-based  system  XLSS  (Extended  I/x;al 
^upovisor  Station).  The  primary  goal  of  such  a  system  is  to  protect  an  area,  a  plant 
or  a  building  complex  reliably  from  bretdc-in,  fire  and  other  undesired  events.  In 
addition,  the  system  should  be  able  to  contrcd  parts  of  the  building,  {dant,  or  area, 
and,  of  course,  reflect  (and  diow)  the  current  state  of  the  “outer  wwld”  as  well  as  die 
“inner  status”  at  any  time. 

The  overall  system  is  composed  of  the  central  CSS  for  processing,  managing, 
visualization  and  operating,  and  the  peripheral  subsystems,  which  are  partially 
autonomous  sources  of  inftmnation  (and  contnd),  provided  by  different  vendras  and 
ftdlowing  different  communications  and  control  strategies.  The  overall  system 
dependability  is  limited  by  the  dqiendability  charactmstics  of  the  peripheral 
subsystems;  the  goal  of  the  design  was,  that  peripheral  subsystems  as  well  as  human 
operators,  guards  etc.  can  justifiably  rely  on  the  C^S  services. 

Several  constraints  concerning  the  oivironment  and  the  target  hardware 
components  were  given  by  Philips,  so  it  was  not  possible  to  choose  qiecial  reliable 
computing  elements.  The  most  impratant  restrictions  were  the  fdlowing: 

R1  The  usage  of  standard  off-the-shelf  Vighsi  Equqmient  Corpraation  hardware 
and  software,  eqiecially  VAX  computers  running  the  VAXA^S  (floating 
system. 

R2  The  usage  of  a  standard  Ethernet  Local  Area  Network  (LAN),  including 
standard  netwcxk  controllers  and  (xotocols  (CSMA/CD). 

R3  The  usage  of  customer-defined  and/ra*  pre-installed  subsystems  (redundant  or 
non-redundant  communication  lines),  which  represent  the  interface  to  the  real 
world. 

R4  The  ust^e  of  standard  software  components  wherever  possible  (e.g.  a 
standard  database  system  and  die  Graphical  Kernel  System  GKS  [4]). 

RS  A  restricted  developmoit  budget  and  a  target-date  for  completion. 

Restriction  R2  prevents  the  use  of  special  (hardware)  network  attachment 
contndlers  as  desoibed  in  [5],  and  it  also  implies  the  discussion  of  Ethernet  being 
adequate  for  real-time  (e.g.  [6]),  and  dependtdiility  [7]. 

Restriction  R3  also  means  that  some  subsystems  are  not  available  for  “off-line” 
(Ub)  software  tests;  the  scrftware  for  diese  subsystems  must  be  carefully  checked  out 
in  the  “living”  system. 
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All  these  limitations  lead  to  the  challoiging  task  of  designing  and  building  a 
diea|>,  fe>usaMe,  and  dependable  system  consisting  only  of  standard 
components  [8]. 

2.1  Building  Blocks 

The  main  strmegy  for  die  system  was  to  qilit  up  the  software  and  the  hardware  into 
building  Mocks. 

2.1.1  Software  Bmlding  Blocks 

The  basic  decision  while  designing  the  sMtware  was  (i)  to  separate  the 
CSS-software  itto  an  aibitrafy  number  of  processes  (about  40  at  the  moment),  mid 
Oi)  10  provide  a  single  logical  commimication  padi  for  inter-process  communication. 
To  avoid  an  uncontrolled  information  exchange  between  processes,  primary 
communication  paths  have  been  introduced.  These  primary  communication  paths 
define  groups  of  processes  that  may  exchange  information.  The  {xocesses  are 
grouped  into  two  classes,  namely  central  processes  and  peripheral  processes.  Central 
processes  are  typically  die  Central  Coc^nator  or  the  Database  Access  Module, 
peripheral  processes  are  Human  Inteitsa:e  inocesses  (x  die  Subsystem  Access 
processes  [1].  Communication  can  only  take  place  (i)  between  a  perqiheral  and  a 
central  process,  and  (ii)  between  central  pocesses.  The  main  difference  of  central 
and  po^pheral  processes  is,  that  central  processes  mas/  be  available  in  the  system, 
and  peripheral  processes  may  be  avaikdile  in  the  system.  So  the  peripheral  processes 
can  be  seen  as  one  of  the  ‘*scaleable  parts”  of  a  CSS.  Fig.  1  shows  a  simple  structure 
of  CSS  processes  in  a  single  node  and  the  pimary  communication  paths. 


Human  Inf  rfaoa  I 


Human  Intarfaea  II 
Protoool  Prinlaf 

Figure  1  CSS  Processes  in  a  Single  Node 

This  structure  has  the  following  advantages: 

A1  The  system  may  grow  as  completely  new  processes  (e.g.  new  subsystnns  or 
new  human  int^aces)  can  be  added  very  easily. 

A2  The  processes  may  reside  all  on  one  node,  or  may  be  distributed  over  several 
network  nodes  as  the  process  communication  can  be  seen  as  remote 
procedure  calls. 

A3  Single  (or  all)  processes  may  be  replicated  according  to  the  needs  of  a 
qiecific  CSS  installation. 


2.12  Hardware  Building  Blocks 


As  mentioned  above,  one  of  the  main  characteristics  of  the  system  is  its  scaleability. 
There  are  several  ways  in  which  this  takes  place.  For  the  hardware,  two 
requirements  had  to  be  fulfilled,  flrst  the  requirement  to  build  a  /i-fold  redundant 
system,  and  second  to  provide  interfaces  (typically  RS232)  to  a  conceptually 
unlimited  number  of  sub^stems.  The  first  characteristic  can  simply  be  realized  by 
connecting  n  nodes  to  a  network,  wh^  each  node  runs  a  CSS.  The  second 
characteristic  is  of  more  interest,  because  it  is  not  possible  to  put  an  arbitrary 
mimber  of  interfaces  to  a  computer,  so.  Terminal  Servers  were  used  (Fig.  2). 


Figure  2  Totninal  Saver 

In  addition  to  the  benefit  of  having  an  arbitrary  number  of  communication  lines. 
Terminal  Servers  offer  the  advantage  of  being  able  to  control  a  (server) 
communication  peat  from  several  nodes.  This  feature  makes  a  special  line-switching 
hardware  obsolete.  The  switching  of  lines  to  different  nodes  can  be  controlled  by  the 
software.  Disadvantages  are  discussed  in  Sec.  5.1. 

3  Process  Cominunication 

Process  communication  takes  place  by  means  of  mailboxes.  The  principal  idea  fw 
mailbox  communication  between  CSS  ixxx^esses  is  that  each  CSS  process  has 
exactly  one  mailbox  where  it  receives  information  (messages).  The  basic  view  of 
two  communication  CSS  (Mxxresses  is  show  in  Fig.  3. 


Figure  3  Basic  Communication  of  Two  CSS  Processes 

This  kind  trf  communication  strategy  ensures  that  each  process  has  precisely  one 
input  channel.  Handling  messages  in  this  way  would  cause  some  situations  where  a 
process  may  be  blocked.  For  instance,  when  process  A  is  busy  (i.e.  not  ready  to 
empty  its  own  mailbox)  and  another  {xocess  B  would  fill  up  the  mailbox  of  process 
A.  In  this  case,  process  B  would  be  biwked  because  it  cannot  get  rid  of  its  messages. 


Pro  COM  B 


MaUbox 


ProooM  A 


3.1  Receive  Queues 
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To  avoid  these  situations,  it  must  be  guaranteed  that  a  i^ocess  is  always  able  to 
empty  its  mailbox.  This  is  done  by  splitting  a  CSS  process  into  two  (w  more) 
paiidkl-woridng  threads  [9].  Thread  1  reads  the  mailbox  asynchronously  and  puts 
the  message  into  into  an  artutrary  large  FIH>  queue,  which  is  read  in  thread  0. 
Thread  0  (the  application  program)  does  not  read  ftom  the  mailbox,  but  from  die 
receive  queue  (Fig.  4). 
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Figure  4  Receive  Queue  of  One  CSS  Process 

Even  if  thread  0  is  blocked  (e.g.  if  the  ^qiplication  program  is  doing  some 
calculation),  thread  1  is  still  working. 

A  problem  arises  when  the  system  must  handle  (soft)  real-time  events  with  this 
kind  of  communication  structure.  Assuming  that  a  burst  of  non-ieal-time  events 
followed  by  a  real-time  event  would  cause  the  real-time  event  to  be  delayed.  A 
solution  for  this  problem  is  to  introduce  several  receive  queues  as  shown  in  Fig.  5. 
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Figure  S  Multiple  Receive  Queues  of  One  CSS  Process 

The  events  (messages)  in  the  CSS  have  assigned  several  priorities,  and  each 
priority  has  a  separate  receive  queue.  The  receive  part  of  the  CSS  message  handling 
environment  first  scans  the  the  first  receive  queue  (with  priority  0,  which  means 
"'real-time”  priority)  and  delivox  the  message  to  the  application  program.  Then  all 
other  queues  are  scanned.  This  method  still  requires  the  use  of  bounded  loops 
(c.g.(101). 
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32  Real-World  Interface 

The  pievioudy  described  mechanisms  are  also  used  to  implement  a  non-polling 
(event-driven)  form  (rf  real-world  communication.  The  communication  layers  are 
extended  by  an  additional  thread.  This  thread  may  handle  different  protocds,  e.g. 
prouxx>ls  for  external  devices,  or  the  X-I^otocol  (Fig.  6).  After  completion  of 
protocol  handling  in  thread  2,  the  flnal  ^^packet”  is  fxocessed  by  the  system  as  any 
other  message. 


Figure  6  External  Protocols 

As  mentioned  above,  this  stack  is  used  to  build  an  interface  to  foreign  protocols. 
These  (Hotocols  are  widely  used  in  the  CSS  as  a  link  to  different  subsystems,  about 
30  at  the  moment. 

4  Network 

The  principal  communication  layers  of  a  CSS  process  are  shown  in  Fig.  6.  The 
sending  object  S  can  be  any  other  process  or  even  the  receiving  process  itself.  So  it 
is  (or  appears  to  be)  very  easy  to  expand  the  system  for  netwoik  usage  by  simply 
adding  a  new  sending  object  which  performs  a  network  operation.  Netwoik 
capability  may  be  introduced  to  such  a  system  for  the  following  two  reasons: 

N1  to  build  a  client-server  system 

N2  to  distribute  data  to  process  replicas 

of  which  only  N2  is  of  interest  here. 

The  main  difference  between  these  two  items  is  that  N2  requires  (parallel) 
communication  with  an  usually  unknown  number  of  (netwcnk)  partners.  Several 
techniques  have  been  introduced  to  p^form  these  multicast  communications 
[11,  12].  But  due  to  some  restrictions  given  in  Sec.  2,  it  was  not  possible  to 
imfriement  a  complete  reliable  group,  or  multicast,  communication  protocol. 
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4.1  ISO  Network  Layers 

The  method  described  now  tries  to  combine  the  benefits  of  different  ISO  network 
layers  to  cover  the  needs  of  N2  and  to  minimize  the  expenditure  of  implementation. 
The  following  prerequisites  were  given: 

(i)  An  Ethernet  Local  Area  Network,  thm  provides  multicast  and  broadcast 
functionality  on  the  Data  Link  Layer. 

(ii)  An  ISO  network,  that  provides  reliable  communication  above  the  Netwrxk 
Layer  (with  the  Network  Service  PrcHocol  NSP,  “A  protocol  that  provides 
reliable  message  transmission  over  virtual  circuits.  Its  functions  include 
establishing  and  destroying  logical  links,  error  control,  flow  control,  and 
segmentation  and  re-assembly  of  messages"  [13]). 

A2  Simulated  Multicasts 

The  idea  now  was  to  use  these  two  network  features,  namely 

•  rea/ multicasts  on  the  Data  Link  Layer, 

•  and  an  ordered  set  of  reliable  unicasts  on  the  Network  Layer 

together  as  simulated  multicasts  in  the  following  manner.  Each  process  transmits  a 
unique  "hello"  multicast  packet  (unique  means  (i)  a  group-unique  Ethernet  Protocol 
ID,  and  (ii)  a  group-unique  multicast  addres.s'  both  administrated  by  the  CSS)  on  the 
Data  Link  Layer  periodically,  which  is  r^ponded  to  by  the  instances  of  that 
processes  on  other  iKxles.  The  response  to  this  real  multicast  packet  (a  unicast 
packet)  is  used  to  build  a  list  of  network  partners.  Both  activities  are  performed  in 
different  threads.  Further  netwmk  calls  use  this  list  to  transmit  data  with  the  NSP  as 
simulated  multicasts,  i.e.  a  sequence  of  unicasts,  which  are 

(i)  strictly  sequential, 

(ii)  synchronous,  and 

(iii)  do  not  require  any  protocol  handling. 

If  a  process  does  not  respond  to  subsequent  "hello"  packets,  or  if  a  call  on  the 
Network  Layer  fails,  the  process  that  caused  that  failure  is  assumed  to  be  down  and 
will  be  removed  from  the  list. 

Since  communication  on  the  Data  Link  Layer  is  not  reliable  and  packets  may  get 
lost  during  heavy  networic  load,  the  receiving  partner  has  to  wait  a  certain  time 
before  he  is  allowed  to  assume  a  non-responding  process  being  down.  The  (TSS 
application  process  that  uses  the  multicast  mechanism  has  to  call  one  MULTICAST 
routine  only  which  then  poforms  all  necessary  processing  steps. 

An  overview  of  the  simulated  multicast  mechanisms  is  given  in  Fig.  7.  Note  that 
this  figure  does  not  show  the  relation  to  die  CSS  process  communication  as 
described  in  Sec.  3,  Fig.  6. 

Voting  is  also  performed  on  this  level  within  segments  (a)  and  (b).  As  a  result  of 
this  voting,  an  appropriate  message  is  delivered  to  the  calling  process. 
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Figure  7  Simulated  Multicasts 

It  is  clear  that  the  overall  transmission  time  T  for  a  simulated  multicast  packet 
increases  with  the  number  of  nodes  n  invdved,  with 

i=l 

where  is  the  time  consumed  by  the  sender  node,  /,•  is  the  individual  error-free 
transmission  (and  receive)  time  to  node  i.  and  x,-  is  an  additional  delay  caused  by 
retries  due  to  netwmk  errors,  and  the  individual  transmission  time  r,-  depend  on  the 
CPU  power  only,  t,.  depends  on  the  network  load  and  the  CPU  power,  with 
T,- »  0  Vi(i  €  { I...11})  for  an  error-free  transmission.  Table  1  gives  an  overview  of 
i^vidual  transmission  rates  (r^  •«-  r  +  x,  /t  =  1). 


CPU  Type 

CPU  Power 

R 

(Mbps) 

X 

a 

CSS 

VUP 

SPECmaric 

VAXstation  2000 

1.0 

0.9 

- 

0.06 

144 

2.608 

MicroVAX  3600 

2.5 

3.2 

- 

0.34 

144 

0.110 

VAXstation  3100/76 

10.0 

7.6 

- 

0.67 

144 

0.252 

VAXstation  4000/90 

25.0. 

- 

32.8 

1.93 

144 

0.032 

Table  1  Network  Transmission  Rates  of  Application  Data 

Data  are  based  on  the  communication  between  two  computers  of  the  same  type 
over  an  Ethernet  Local  Area  Netwmk  (10  Mbps).  The  transmission  time  T  of 
user-level  packets  with  a  constant  size  of  128  octets  of  application  data  has  been 
converted  into  the  transmission  rate  R  given  in  Mbps.  (Note  we  have  to  distinguish 
between  a  user-level  information  packet  and  a  LAN-level  information  packet!)  The 
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CPU  powo'  is  given  in  different  tenns  (CSS  »  CSS-specific  computing  power 
relative  to  a  VAXstation  2000;  VUP  «  VAX  Units  of  Processing  ~MIP),  jc  is  the 
number  of  measurements,  o  is  the  standard  deviation. 

5  Experience  and  Field  Data 

The  system  is  now  installed  at  6  sites  with  a  sum  of  10  system-years  of  (^)aation. 
Experience  and  field  data  refa*  to  these  systems. 

5.1  Terminal  Servers 

Tominal  Servers  as  mentioned  in  Sec.  2.1.2  turned  out  to  be  the  most  problematical 
components  in  the  CSS.  The  ixoblems  encountered  were 

(i)  questionable  real-tune  characteristics 

(ii)  unmotivated  “port  stops” 

(iii)  diffoient  behavior  of  different  types  of  Servers 
of  which  w*  will  discuss  (i). 

5.1.1  Timing  Problems 

Fig.  8  shows  the  typical  I/O  timing  behavkn’  of  a  Terminal  Server  with  the  following 
setups:  DECserver  200/MC  (V3.1  BL37,  LAT  V5.1,  ROM  BL20),  RS232, 
1200  baud  transmit/teceive  speed,  one  start-bit,  one  stop-bit,  even  parity, 
23  characters  message  length,  VAXstation  3100/76,  VAX/VMS  V5.5-2. 

l/OTim«(s) 


Obtarvation  Tima  (h) 


Figure  8  IA5  Time  on  a  Terminal  Server 

A  “Schauer  PDU/DCFZ?”  clock  was  used  as  a  data-generaior.  Every  five  minutes 
this  clock  transmits  a  23  character  packet  containing  the  current  dale  and  time  over  a 


serial  line.  This  time  information  was  compared  with  the  current  system  time,  as  the 
line  clock  was  assumed  to  be  accurate,  and  the  system-clock  was  assumed  to  be  a 
“good”  clock  [14].  The  result  of  this  measurement  was  that  the  transmission  and 
processing  time  of  a  packet  was  between  0.2  seconds  and  0.3  seconds  during  normal 
networic  load  (2%)  and  low  CPU  load  (<1%).  Heavy  networic  load  (50%)  increased 
die  transmission  time  up  to  0.6  seconds,  and  electrical  network  failures  caused 
delays  of  over  2  seconds. 

Using  a  direct  communication  link  (UART)  gives  a  constant  transmission  time  of 
0.2  seconds  (not  shown  in  Fig.  8). 

As  the  (TSS  was  not  designed  to  be  a  hard  real-time  system,  is  was  possiUe  to 
s(dve  these  timing  problems  within  the  CSS  software. 


52  System  Field  Data 

TaUe  2  gives  an  overview  of  the  systems,  nodes,  disks,  and  Terminal  Servers 
currently  installed  and  (grating  in  the  Held. 


Systems 

Nodes 

Disks 

Servers 

Number  of  elements 

6 

12 

16 

8 

Opoutional  years 

10 

22 

30 

14 

Damages 

n/a 

0 

2 

1 

Table  2  CSS  Elements 


To  date,  three  accidents  have  occurred;  two  head  crashes  on  disks,  and  one 
Terminal  Server  Ineakdown.  The  head  crashes  have  been  tolerated  by  the  CSS  since 
the  VAX/VMS  operating  system  automatically  shut  down  and  the  CSS  processes 
tunning  on  those  nodes  have  been  recognized  as  being  unavailable.  The  Terminal 
Server  breakdown  has  been  tolerated  for  those  subsystems  with  redundant 
communication  lines  to  different  Terminal  Servers. 


System 

Node 

Down  reasons 

downs 

UA  (hours) 

downs 

1 .  Power  fails  (test,  service) 

0 

0 

7 

2.  CSS  software  failures 

1 

24 

8 

3.  VMS  software  failures 

0 

0 

5 

4.  Hardware  failures 

0 

0 

2 

5.  Maintenance 

21 

3.3 

61 

I 

22 

27.3 

83 

Availability 

99.95% 

I 

Availability 


Table  3  CSS  Downs 
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Table  3  ^ws  the  down-times  of  single  nodes  and  the  whole  system.  The 
unavailability  (UA)  the  whole  systm  had  two  reasons;  Hrst  a  fatal  software 
fidlure  in  a  central  CSS  process  that  caused  all  nodes  to  be  inoperable.  The  failure 
was  rqnired  within  24  hours.  The  second  reason  was  (and  is)  the  down  time  due  to 
system  maintenance  and  software  upgrades.  All  other  node  fiulures  have  been 
tolerated  by  die  CSS. 


6  Conclusion 

A  system  overview  of  a  dependable,  scaleable  distributed  system  has  been  given  (fra' 
a  more  broader  descrqttkxi  see  [1]).  The  design  goals  of  scaleability,  flexibility  of 
configuration,  eigonomy  of  human  interfaces,  flexibility  to  integrate  new  pmpheral 
subsystems  and  the  triplication  of  standards  as  far  as  possiMe  have  been  reached  by 
^^modularization  through  distribution*',  which  includes  the  concept  of  hardware  and 
software  building  blocks,  process  replication  and  standard  ISO  netwoilc  layers. 
System  maintainability  and  the  possibility  of  easy  implementation  of  a  variety  of 
fault  tolerant  architectures  are  further  results  of  that  concept.  These  goals  cannot  be 
reached  by  a  single  processor/single  layer  approach,  although  there  are  some 
tradeoffs  with  respect  to  some  of  the  dependability  attributes  when  a  distributed 
solution  is  chosen. 

Dependable  communication  mechanisms  have  been  identified  as  the  key  issue  for 
providing  reliable  services  for  process  fail-over  strategies  and  dynamic 
reconfiguration. 

It  has  been  shown,  how  on  basis  of  standard  hardware  and  software,  by  adding 
some  additional  software  using  Ethernet  multicasts  to  provide  the  valid 
configuration  status  information,  and  simulated  multicasts  within  the  ISO  stack 
framework,  a  reasonable  depend2d)le  system  with  reasonable  real-time 
characteristics  has  been  implemented.  Some  figures  and  field  data  as  well  as 
relevant  implementation  details  have  been  presented. 

Until  the  end  of  the  year,  the  CSS  system  will  be  installed  at  ten  sites,  mainly 
large  banks  and  museums  (including  WAN-networks  connecting  several  buildings 
and  branches). 
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1.  Introduction 

This  p^r  addresses  some  issues  involved  in  real-time  detection  of  failures  of  re¬ 
active  systems.  The  system  architecture  considered  is  shown  in  Figure  1.  Extonal 
behavior  of  the  reactive  system  is  monitored  by  a  supervisor,  which  may  execute  on 
a  separate  platform.  The  supervisor  monitors  the  inputs  and  outputs  of  the  system 
and  reports  the  failures  that  occur. 

Real-time  detection  of  failures  has  a  number  of  benefits.  Consider  an  application 
such  as  telecom  switching: 

a)  Early  rq)oiting  of  failures  gives  the  operating  company  an  opportunity  to  repair 
the  underlying  fault  before  the  users  start  filing  complaints. 

b)  Certain  kinds  of  failures,  such  as  those  due  to  loss  of  shared  resource  units,  are 
visible  only  to  an  entity  with  global  perspective.  Early  notification  of  such  failures 
makes  it  possible  to  take  ccHiective  steps  before  its  accumulated  effects  result  in 
major  service  disruptions. 

c)  In  many  reactive  systems,  failures  of  control  software  do  not  have  immediate  ef¬ 
fect.  Because  of  mechanical  inertia  etc.,  a  long  time  intoval  may  elapse  before  a 
software  failure  has  detrimental  impact  (xi  the  controlled  hardware.  Real-time  de¬ 
tection  of  failures  provides  a  basis  for  subsequent  retraction  of  their  effects. 

Supervision-based  approaches  to  failure  detection  and  retraction  are  becoming  more 
important  as  systems  and  their  control  programs  are  constructed  from  off-the-shelf 
components. 

In  applications  in  which  the  external  behavior  of  the  reactive  system  is  specified 
formally,  it  is  attractive  to  have  the  supervisor  execute  (or  interpret)  a  model  doived 
from  system  specification. 

The  paper  considers  the  case  when  the  external  behavior  of  the  target  system  is 
specified  by  a  model  based  on  communicating,  extended  finite  state  machines  (spec- 
ificaticx)  processes).  The  formalism  used  is  the  CCITT  Specification  and  Description 
Language  (SDL)[1].  SDL  is  an  international  standard  used  in  the  telecommunication 
industry.  SDL  specification  of  external  behavior  is  suiplemented  by  the  specification 
of  response  times.  The  focus  of  the  paper  is  on  event-driven  applications  whose  pro¬ 
cessing  is  relatively  simple.  A  typical  application  is  telecom  switching. 

Supervision-based  failure  detection  has  some  similarities  to  automated  test  ora¬ 
cles  [4].  However,  automated  oracles  do  not  detect  failures  as  they  occur  and  usually 
assume  a  particular  resolution  of  specification  nondeteiminisms.  Real-time  monitors 
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(see,  e.g.  [S])  work  in  real-time,  but  are  typically  closely  coupled  with  the  program 
being  monitored.  Supervision-based  failure  detection  also  resembles  approaches 
such  as  the  safety  bag[6].  However,  SB  checks  for  violations  of  safety  regulations 
and  aims  to  prevent  failures  from  occurring  in  the  first  place.  Specialized  techniques 
developed  to  detect  certain  kinds  of  telephone  exchange  failures  in  real-time  can  be 
found  in  articles  describing  their  maintensmce  software  (see,  e.g.  [7]).  The  theory  of 
beliefs,  introduced  in  Section  3,  was  inspired  by  the  truth  maintenance  systems  and 
nonmonotonic  reasoning[81. 

This  paper  is  organized  as  follows.  Section  2  overviews  the  CCITT  SDL.  Section 
3  discusses  the  two  basic  strategies  for  failure  detection  in  real-time  (input  and  out¬ 
put-driven)  and  presents  formulas  that  estimate  their  processing  and  memory  re¬ 
quirements.  Section  4  describes  experience  with  input-driven  supovisor  which  was 
developed  to  automatically  collect  failure  data  of  a  small  exchange.  Section  S  offers 
concluding  remarks. 


2.  Specification  Formalism  and  Issues 


Structurally,  an  SDL  specification  consists  of  a  hierarchy  of  blocks.  Blocks  are  inter¬ 
connected  by  channels.  Channels  carry  signals  between  blocks.  A  leaf  block  contains 
one  or  more  SDL  processes,  whose  behavicH*  is  specified  by  an  extended  finite  state 
machine.  Specification  processes  may  contain  local  variables,  which  may  be  updated 
and  tested.  Processes  within  a  block  communicate  by  exchanging  signals  over  sig- 
nalroutes.  SDL  semantics  is  defined  operationally,  by  the  Abstract  SDL  Machine  [1]. 

SDL  is  illustrated  in  Figure  2.  This  figure  shows  partial,  SDL-based  specification 
of  call  processing  for  a  small  (and  simplified)  telephone  exchange.  Part  (a)  of  the  fig¬ 
ure  shows  the  block  diagram  and  part  (b)  gives  a  fragment  of  behavioral  specifica¬ 
tion  for  the  Line  Handler,  one  of  the  processes  in  the  block  diagram. 

Part  (a)  shows  that  the  specification  consists  of  two  major  blocks.  One  contains 
the  LineHandler  processes,  which  are  responsible  for  the  external  behavior  of  the  ex¬ 
change  seen  by  individual  phones.  The  other  contains  a  resource  manage  process. 
This  process  controls  the  sharing  of  exchange  hardware  resources  needed  to  process 
a  call.  One  resource  class  may  be  the  touchtone  receivers,  which  decode  the  digit 
from  the  tones  sent  by  the  phone  when  a  key  is  pressed.  For  simplicity.  Figure  2 
shows  only  one  resource  manager  process. 

Part  (b)  states  that  when  the  telephone  is  idle  and  goes  offhook,  a  request  signal 
for  the  resources  needed  to  handle  the  origination  will  be  sent  to  the  Resource  Man- 


267 


Figura  2.  Illustration  of  SDL  Spscifications 

ager.  The  Resource  Manager  may  grant  the  resources  by  sending  a  Grant  signal  to 
the  Line  Handler,  in  which  case  dial  tone  is  applied  to  the  phone.  If  the  resources  are 
not  granted  (signal  Resource  NotAvailable),  the  phone  gets  the  fast  busy  tone.  If  the 
phone  goes  onhook  while  its  Line  Handler  is  waiting  for  response,  the  OnHook  sig¬ 
nal  is  kept  (Save’d)  until  a  subsequent  state. 

SDL  specification  of  external  behavior  are  complemented  by  performance  speci¬ 
fication  which  give  the  maximum  permissible  time  intervals  between  an  input  signal 
and  the  response(s)  it  triggers  (for  example,  the  time  from  OSHook  to  DialTone). 
Furthermore,  some  external  signals  may  be  communicated  only  indirectly,  by  a  sig¬ 
nal  carrier.  For  example,  OfrHook  and  OnHook  are  encoded  in  changes  of  loop  cur¬ 
rent.  The  specification  includes  the  definition  of  the  minimum  and  maximum  time 
for  which  the  signal  carrier  change  must  be  present  in  order  for  the  encoded  signal  to 
be  recognized. 


Several  issues  must  be  considered  in  the  development  of  the  supervisor.  One  is 
the  incorporation  of  specification  of  resptmse  times  into  the  supervisor.  A  major  is¬ 
sue  arises  out  of  the  nondeterminisms  permissible  under  the  specification  formalism 
used.  SDL  nondeterminisms  fall  into  two  major  categories: 

-  indeterminate  delays  in  communication  of  signals  ova*  channels; 

-  nondeterminisms  in  the  specification  of  behavior  of  individual  processes  (sponta¬ 
neous  transition  NONE  and  nondeterministic  path  selection  ANY  [3]). 

These  nondeterminisms  give  rise  to  different  but  legitimate  external  behaviors.  The 
supervisor  must  be  able  to  jnxrperly  deal  with  such  behavioral  alternatives;  it  should 
not  have  a  preconceived  idea  about  how  the  nondeteiminism  should  be  resolved  in 
the  target  system  and  consider  any  other  alternative  as  failure. 

The  supervisor  must  also  be  able  to  properly  handle  uncotainties  arising  out  of 
the  encoding  of  external  signals  in  signal  carriers.  Over  a  short  interval  of  time  (be¬ 
tween  the  min  and  the  max  permissible  signal  recognition  time),  a  state  change  of 
signal  carrier  may  but  need  not  be  recognized  as  a  valid  signal. 

3.  Supervisor  Strategies  for  Failure  Detection 

In  principle,  there  are  two  basic  strategies  for  supervisor-based  detection  of  failures 
of  reactive  systems  -  the  input-driven  and  the  output-driven.  These  two  strategies  are 
discussed  below. 

3.1.  Input-Driven  Failure  Detection 

In  the  input-driven  strategy,  when  an  input  is  observed,  the  supervisor  precom¬ 
putes  the  possible  system  outputs  triggered  by  it  and  stores  them  (Figure  3).  Because 
of  specification  nondeterminisms,  there  may  be  more  than  one  legitimate  output 
When  an  output  from  the  taiget  system  is  observed,  the  supervisor  compares  it  to 
those  on  the  list.  If  a  match  is  foun^  the  supervisor  removes  from  the  list  the  alterna¬ 
tives  not  pursued  by  the  target  system  and  updates  the  supervisor  model  state.  If  no 
match  is  found,  the  supervisor  concludes  that  a  failure  has  occurred  and  reports  it. 
The  supervisor  then  attempt  to  re-synchronize  with  the  target  system  so  that  does  not 
report  the  subsequent  legitimate  behavior  of  the  target  system  as  failures. 

To  properly  handle  the  nondeteiminisms  present  in  the  specification  model,  the 
supervisee:  must  be  able  to  consider  several  behavioral  alternatives  simultaneously. 
The  theory  of  beliefs  has  been  developed  for  this  purpose[9].  In  this  theory,  a  sepa- 
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rate  thread  of  a  specification  process  (a  belief  about  its  behavior)  is  created  to  repre¬ 
sent  a  behavioral  alternative.  In  the  case  of  SDL.  a  major  source  of  nondeterminism 
is  the  indetominate  propagation  delay  of  signals  over  channels.  In  the  belief  theory, 
when  a  process  sends  a  signal  over  a  channel,  the  destination  process  is  split  into  two 
threads.  One  thread  represents  the  alternative  that  the  destination  process  has  re¬ 
ceived  the  signal  and  the  other  that  the  signal  is  still  in  transit.  The  former  thread  will 
process  the  signal  and,  if  ^propiate,  produce  signals  to  other  specification  process¬ 
es  or  to  the  external  world.  The  latter  thread  stores  the  signal  in  transit.  This  thread  is 
needed  to  properly  handle  the  case  when  another  process  sends  a  signal  to  the  desti¬ 
nation  process  at  about  the  same  time.  Due  to  die  indeterminate  delays  over  chan¬ 
nels,  the  second  signal  might  actually  have  arrived  to  the  destination  {xocess  earlier 
than  the  first  The  signal-in-transit  thread  is  used  to  generate  all  possible  signal  arriv¬ 
al  sequences  at  the  destination  {x-ocess.  The  threads  representing  consistent  behav¬ 
ioral  alteratives  of  specification  processes  are  linked  into  sets.  Note  that  in  the 
scenario  discussed,  the  two  threads  of  the  destination  process  stand  for  mutually  ex¬ 
clusive  behavioral  alternatives. 

When  an  output  from  the  system  is  obs^ed,  the  behavioral  alternatives  (thread 
sets)  disproved  by  it  are  terminated  and  their  constituent  threads  deleted. 

If  the  specification  of  behavior  of  a  process  includes  a  nondeterministic  construct 
in  the  transition  being  executed,  a  separate  thread  must  be  created  for  each  possible 
transition  path.  As  before,  the  alternatives  invalidated  by  the  subsequent,  actually 
observed  external  behavior  are  terminated. 

Figure  4  presents  a  high  level  model  of  the  processing  involved  in  propagating  an 
input  signal  through  D  communicating  pxKesses  before  the  output(s)  it  triggers  are 
produced.  The  small  rectangles  attached  to  processes  represent  the  signals  in  transit. 


Figure  4.  Processing  of  inputs  in  Input-Driven  Strategy 


The  processing  time  requirements  of  input-driven  strategy  can  be  estimated  from 
Figure  4.  The  processing  time  needed  to  pass  a  signal  through  a  specification  process 
P\& 


Tp  =  /y,  +  N^(t,  +  /,) 


(3.1) 
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whoe 

Nf^  the  mean  forward  nondeterminism  factor  (the  number  of  transitions  that  may 
potentially  be  executed  as  a  result  of  a  given  incoming  signal), 
r,  s  the  mean  transition  processing  time, 
tg  s  the  mean  interprocess  communication  and  context  switch  time, 
tg  s  the  mean  cost  of  creating  a  new  thread  (a  clone)  of  a  specification  process  and 
storing  its  signal  in  transit. 

The  ou^ut  signal(s)  produced  by  /*  are  going  to  be  processed  by  Ay  specification 
processes.  (Note  that  the  signal-in-transit  thread  does  not  process  the  signal;  the  sig¬ 
nal  is  merely  stored.)  This  will  repeat  itself,  until  a  specification  process  is  reached 
which  generates  an  external  output  signal.  If  D  is  the  mean  number  of  processes  the 
external  input  passes  through  before  an  external  output  is  generated,  the  total  cost  of 
processing  of  an  input  signal  can  be  approximated  as 

Ci=  (l+Ay+N^+...+iV®-‘)rp  (3.2) 

Note  that  the  cost  of  matching  (and  of  ensuing  termination  of  invalidated  threads) 
was  not  included  in  the  above  formula. 

The  additional  memory  needed  for  signal-in-transit  threads  can  be  estimated  as 

M  =  (1 +Ay+A^+ ... +A^^  (3.3) 

where  is  the  memory  required  for  a  thread  (including  its  input  port). 

In  the  model  considered,  each  process  along  the  input  signal  propagation  path  had 
only  one  thread.  However,  under  some  circumstances,  more  than  one  thread  may 
temporarily  co-exist  This  is,  for  example,  the  case  with  the  resource  manager  {vo- 
cess  of  Figure  2,  which  is  on  several  input-ouq}ut  paths.  Consider  the  case  when  a  re¬ 
quest  signal  Rp  from  process  P  is  sent  to  a  resource  management  irocess  (M).  Two 
threads  of  M  will  co-exist  for  a  brief  time,  until  an  external  output  is  observed  which 
will  cause  one  to  terminate.  If  another  process,  Q,  sends  request  Rq  to  M  before  the 
termination  occurs,  five  threads  will  have  to  be  created  reflecting  all  signal  arrival 
possibilities  at  M  -  RpRq,  RgRp  Rp  received  and  Rq  in  transit,  Rq  received  and  Rp  in 
transit,  and  both  Rq  and  Rp  in  transit.  In  general,  if  r  is  the  number  of  requests  to  Af 
the  effects  of  which  have  not  yet  been  confirmed  through  external  ouq>ut,  the  number 
of  threads  of  Af  is  [10] 
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Even  for  small  r,  the  number  of  additional  threads  may  be  large.  As  a  conse¬ 
quence,  the  processing  costs  and  memory  requirements  in  the  input-driven  approach 
may  be  subject  to  sudden  surges. 

To  detect  response-time  failures,  the  input-driven  strategy  may  take  advantage  of 
the  form  of  response-time  specifications,  which  are  stated  in  terms  of  maximum  time 
interval  between  a  stimulus  (external  input)  and  a  response  (external  output).  IWo 
cases  are  possible.  When  the  same  specification  process  receives  the  stimulus  and 
produces  the  response,  it  is  sufficient  for  it  to  set  up  a  timer  upon  the  receipt  of  the 
stimulus.  If  the  response  arrives  before  the  timer  expires,  it  is  canceled.  If  not,  the 
timer  times  out  and  performance  failure  is  reported.  This  approach  must  be  extended 


in  cases  when  the  re^nse  is  generated  by  a  qjecification  process  different  from  the 
one  that  received  the  stimulus.  Note  that  the  cost  of  setting  up  and  cancelation  of 
timers  was  not  included  in  the  above  formulas. 

32,  Output-Driven  Failure  Detection 


The  output-driven  strategy  is  an  opposite  of  the  input-driven  one.  It  is  feasible 
when  the  processing  done  in  state  transitions  can  be  easily  reversed.  In  this  strategy, 
the  inputs  to  the  target  system  are  kept  in  a  buffer  figure  5).  When  an  ou^ut  from 
the  target  system  is  observed,  it  is  propagated  backward  through  the  specification 
model.  The  input  signal(s)  that  caused  it  are  det^mined.  The  input  buffer  is  searched 
for  the  signal(s)  expected.  If  a  match  is  found,  the  supervisor  updates  the  state  of  the 
specification  model  and  removes  the  input  signals  whose  effects  have  been  fully  ac¬ 
counted  for  from  the  input  buffer.  If  there  is  no  m^ch,  the  supervisor  concludes  that 
a  failure  must  have  occurred  (there  is  no  cause  for  the  output  observed).  Note  that  the 
backward  tracing  of  an  output  signal  may  not  necessarily  reach  a  specification  pro¬ 
cess  that  takes  input  from  the  environment.  It  may  cease  at  an  internal  process  which 
is  in  a  state  that  cannot  produce  the  needed  signal. 

Figure  6  presents  a  high  level  model  of  the  processing  involved.  The  model  takes 
into  account  the  possibility  that  the  signal  traced  might  have  been  produced  by  sever¬ 
al  transitions  emanating  fr-om  the  current  specification  state  in  the  sending  process 
and  that  there  might  have  been  several  possible  sources  for  the  triggoing  signal. 

The  processing  time  requirements  of  output-driven  strategy  can  be  estimated 
from  Figure  6.  The  processing  time  needed  to  trace  a  signal  backward  through  a 
specification  process  P  is 

Tp  =  (N,r,+iV^,/,) 

where 

t,  =  the  mean  cost  of  (backward)  transition  processing, 
tj  =  the  mean  cost  of  backward  signal  propagation, 

N,  =  the  mean  number  of  transitions  in  the  current  state  of  the  specification  process 
that  could  have  emitted  the  signal  traced, 

Ng  =  the  mean  number  of  processes  that  could  have  emitted  the  triggering  signal. 

The  number  of  specification  processes  that  must  be  visited  after  the  trace-back 
through  one  specification  process  is  N/^g.  The  overall  cost  of  tracing  back  the  output 
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Figure  6.  Processing  of  Outputs  in  Output-Drive  Strstegy 

produced  through  D  process  levels  can  then  be  expressed  as 

=  (1  Tp 

In  the  processing  model  of  Figure  6,  no  additional  memory  is  required  to  process 
an  output  signal. 

As  observed  earlier,  a  major  advantage  of  ou4)ut-driven  strategy  with  respect  to 
the  input-driven  one  is  that  it  does  not  have  to  dir^tly  enumerate  all  possible  behav¬ 
ioral  alternatives.  It  waits  to  see  which  one  will  actually  happen.  However,  the  speci¬ 
fication  nondeterminisms  must  be  taken  into  account  in  explaining  what  has 
h2q)pened.  In  particular,  the  indeterminate  channel  delays  must  be  considered  to  cor- 
recdy  explain  the  outputs  observed.  As  in  illustration,  consider  the  scenario  when 
phone  A  and  B  call  phone  C  almost  simultaneously.  Assume  that  B  is  going  to  be 
successful.  If  the  ringing  on  phone  C  is  the  first  ouq>ut  signal  detected,  it  would  be 
incorrect  for  the  supervisor  to  stop  its  search  as  soon  as  it  discovers  that  A  has  dialed 
C.  For  a  brief  interval  of  time,  it  has  to  consider  both  A  and  B.  It  is  only  when  the  ex¬ 
ternal  outputs  on  phones  A  and  B  (i.e.  busy  and  ring  tone)  are  observed  that  the  su¬ 
pervisor  may  eliminate  the  alternatives  invalidated.  The  theory  of  beliefs  can  handle 
such  scenarios  by  creating  two  Line  Handler  threads  for  each  of  A  and  B.  However, 
at  least  in  the  application  domain  considered,  such  scenarios  appeared  to  be  relative¬ 
ly  rare  and  the  cost  of  thread  creation  was  not  included  in  the  formulas  given  above. 

The  detection  of  response  time  failures  in  output-driven  supervisor  is  rather  diffi¬ 
cult,  if  it  is  to  be  done  in  real  time  (i.e.  as  soon  as  the  response  interval  expires).  This 
is  because  of  the  nature  of  output-driven  approach,  in  which  the  work  is  deferred  un¬ 
til  until  the  output  (the  response)  spears.  If  it  is  imperative  that  the  detection  of  such 
failures  be  carried  out  in  real  time,  it  is  usually  necessary  to  separate  the  detection  of 
behavioral  and  response  time  failures  and  use  a  separate  checker  for  the  latter. 

4.  Dlustration  and  Experience 

A  supervisor  based  on  the  ideas  discussed  in  this  paper  was  implemented  for  detec¬ 
tion  of  failures  in  a  small  exchange.  Real-time  detection  of  failures  was  required  for 
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automatic  acquisition  of  failure  data  needed  in  the  development  and  validation  of 
new  software  reliability  prediction  models!  12].  The  exchange  and  its  telephones 
were  emulated  on  a  Unix  workstation.  Programmable  telephone  traffic  generators 
were  employed  to  generate  random  telephone  traffic  with  the  specified  distributions. 

The  specification  of  the  exchange  had  the  general  form  of  Figure  2.  Only  POTS 
calls  (plain,  ordinary  telephone  service)  were  supported.  The  exchange  served  60 
telephones.  Call  origination  rates  ranged  from  8  to  IS  originations/phone/hour.  In¬ 
stead  of  monitoring  the  signals  between  the  exchange  and  the  telephones  as  shown  in 
Figure  1,  the  supervisor  was  monitoring  the  hardware  interface  memory  through 
which  the  exchange  control  program  sensed  and  controlled  the  exchange  hardware. 
This  eliminated  uncertainties  in  output  signal  recognition  (the  detection  of  output 
signals  did  not  suffer  from  signal  recognition  latencies),  but  it  left  input  signal  recog¬ 
nition  uncertainties  in  place. 

The  analysis  given  in  Section  3  was  used  to  evaluate  the  tradeoffs  involved  and  to 
select  the  supervision  strategy.  For  the  input-driven  strategy,  the  factor  was  1  and 
the  processing  cost  was  dominated  by  D  ranged  from  1  to  3.  For  the  output-driv¬ 
en,  the  Nf  factor  was  close  to  1.  However,  for  some  output  signals,  the  VV,  factor  was 
large.  This  was  the  case  whenever  more  than  one  Line  Handler  is  involved  in  back¬ 
ward  propagation  of  outputs.  For  example,  when  ring  tone  to  a  phone  is  observed, 
the  tone  must  be  traced  back  to  the  Line  Handler  for  the  called  phone  and  from  there 
back  again  to  the  Line  Handler  for  the  caller.  For  these  signals,  Ng  is  the  number  of 
telephones  served  by  the  exchange.  Even  for  a  small  exchange,  Ng^  is  a  very  large 
number.  Although  some  heuristics  could  be  built  into  the  backward  search,  this  alter¬ 
native  was  rejected  because  of  concern  of  ending  up  with  an  ad-hoc,  difticult  to 
maintain  supervisor. 

Based  on  these  considerations,  the  input-driven  strategy  was  chosen  for  the  super¬ 
visor.  To  reduce  the  cost  of  implementation,  the  matcher  of  Figure  3  was  combined 
with  the  processes  that  produce  external  outputs  (i.e.  Line  Handlers).  For  example, 
the  bottom  half  of  the  FSM  of  Figure  2b  was  converted  into  the  segment  of  supervi¬ 
sor  Line  Handler  process  shown  in  Figure  7.  (To  reduce  the  size  of  this  figure,  the 
treatment  of  the  OnHook  signal  is  not  shown.) 

This  figure  contains  two  extensions  to  the  standard  SDL  [11].  *0  stands  for  ‘any 


Figure  7.  Segment  of  LineHandler  Supervisor 
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output  from  the  target  system  pertaining  to  the  line  being  supervised  by  this  process 
instance’.  The  crossed  oval  denotes  the  termination  of  the  process  thread.  The  thread 
is  terminated  when  an  external  output  indicate  that  the  behavioral  alternative  repre¬ 
sented  by  the  thread  has  not  been  pursued  by  the  target  system.  (This  is  the  case,  for 
example,  when  a  different  tone  is  sent  to  the  phone). 

Note  that  such  combination  of  functionality  is  possible  only  if  it  can  be  guaran¬ 
teed  that  the  external  inputs  are  popagated  through  the  supervisor  model  faster  than 
they  propagate  through  the  target  system.  This  was  the  case  with  the  emulated  ex¬ 
change.  Alternative  approaches  are  available  for  non-emulated  applications. 

Subsequent  experience  with  the  purely  input-driven  version  of  the  supervisor  has 
shown  that,  during  some  intervals  of  operation  of  the  target  system,  the  supervisor 
ran  out  of  memory.  This  turned  out  to  be  due  to  the  rapid  growth  in  the  number  of 
threads  of  resource  management  [nocesses.  This  occurred  when  the  random  varia¬ 
tions  in  telephone  traffic  resulted  in  a  large  number  of  almost  simultaneous  call  orig¬ 
inations.  In  retrospect,  this  is  not  surprising  in  light  of  equation  (3.4),  but  this  point 
was  only  realized  at  a  later  time.  What  made  this  phenomenon  worse  was  the  posi¬ 
tive  feedback  in  its  dynamics  -  the  heavier  the  load  on  the  exchange,  the  longer  the 
time  interval  between  an  OflHook  and  the  response  (dial  tone  or  fast-busy  tone)  and 
the  larger  the  number  of  threads  that  coexist  before  they  can  be  terminated. 

To  resolve  this  difficulty,  it  was  noted  that  the  processing  of  call  originating  off- 
hooks  typically  results  in  dial  tone  output  to  the  telephone.  The  product  for  the 
dial  tone  output  signal  was  1  in  the  exchange  considered.  This  has  led  to  the  decision 
to  introduce  a  degree  of  output-driven  processing  into  the  input-driven  supervisor. 
The  output-driven  processing  only  applied  to  the  output  signals  that  indicate  what 
the  outcome  of  resource  request  was.  The  boundary  at  which  the  input  and  output 
dri  /en  {vocessing  for  these  signals  (dial  and  fast-busy  tone)  tone  met  was  moved  into 
the  resource  manager.  The  idea  of  combining  the  functionality  of  input-driven  super¬ 
visor  and  natcher  was  retained.  In  the  implementation  of  the  mixed  strategy  supervi¬ 
sor,  the  originating  OffHooks  were  propagated  only  to  the  Resource  Manager.  When 
dial  tone  is  observed  on  a  phone,  the  Line  Handler  sends  a  notification  to  the  re¬ 
source  manager.  As  a  consequence,  the  number  of  behavioral  alternatives  that  had  to 
be  considered  for  the  resource  manager  had  become  substantially  smaller.  The  mem¬ 
ory  overflows  no  longer  occurred.  The  underlying  theory  is  described  in  [10]. 

5.  Concluding  Remarks 

The  paper  considered  real-time  detection  of  failures  of  reactive  systems.  Failures  are 
detected  by  the  supervisor,  a  unit  that  monitors  the  inputs  and  outputs  from  the  target 
system.  The  supervisor  executes  a  model  obtained  from  the  specification  of  the  target 
system.  The  paper  dealt  with  the  case  when  the  target  system  is  specified  in  CCITT 
SDL,  a  language  based  on  communicating  extended  finite  state  machines.  The  focus 
was  on  event-driven  applications  such  as  telecom  switching. 

A  major  issue  in  specification-based  detection  of  failures  are  the  nondeterminisms 
intrinsic  to  the  specification  formalism.  The  supervisor  should  have  no  preconceived 
idea  about  how  the  nondeterminisms  should  be  resolved  and  consider  any  other  al¬ 
ternative  as  failure.  The  paper  briefly  overviewed  the  theory  of  beliefs  which  permits 
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the  supervisor  to  keep  track  of  simultaneous  behavioral  alternatives. 

The  paper  discussed  two  basic  strategies  for  real-time  failure  detection,  tho  input 
and  the  ou4)ut-driven  one.  In  the  former,  when  an  input  is  observed,  the  supervisor 
determines  what  may  h^pen  in  the  future  tU  system  outputs.  In  the  latter,  the  super¬ 
visor  tries  to  explain  the  system  outputs  from  past  inputs.  The  p^r  presented  for¬ 
mulas  that  estimate  the  processing  and  memory  requirements  for  the  two  strategies. 
The  formulas  were  based  on  high-level  model  of  the  processing  involved  and  give 
only  a  rough  estimate  of  the  quantities  estimated.  They  are  principally  useful  in  de¬ 
termining  the  tradeoffs  involved  and  in  the  choice  of  supervisor  mode  of  operation. 

The  paper  described  an  £q)plication  of  real-time  failure  detection  to  automatic  col¬ 
lection  of  call  processing  failure  data  in  a  small  telephone  exchange.  The  exchange 
and  its  phones  were  emulated  on  a  workstation.  A  purely  input-driven  strategy  was 
initially  implemented.  However,  subsequent  experience  showed  that  this  implemen¬ 
tation  was  subject  to  excessive  surges  in  processing  and  memory  requirements  under 
certain  input  scenarios.  To  gain  insight,  the  models  and  formulas  presented  above 
were  developed.  A  hybrid  £q)proach  based  on  partly  output-driven  processing  of  cer¬ 
tain  output  signals  was  implemented.  This  implementation  no  longer  exhibited  the 
large  surges  in  processing  and  memory  requirements. 
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Abstract 

To  in^rove  dependability  various  voting  schemes  are  implemented  in  computer 
systems.  The  paper  analyses  reliability  and  safety  of  elementary  and  con^sed 
majority  voting  systems.  The  tool  for  evaluation  of  such  systems  is  also  proposed. 
It  allows  to  choose  an  architecture  suitable  for  giv»  reliability  and  safety 
requirements. 


1  Introduction 

To  achieve  higher  reliability  and  fault  tolerance  of  conq}uter  systems,  high-quality 
conqwnmts  and  strict  quality  control  procedure  during  the  assembly  phase  can  be 
used,  or  some  form  of  redundancy  techniques  can  be  inq>lemrated  [1].  Both  of  these 
conq>lemrataTy  techniques  lead  to  an  increase  in  system  cost.  This  is  a  price  which 
we  pay  to  satisfy  dependability  requirements  that  are  needed.  Moreover,  another 
main  point  that  faces  the  designers  of  computer  systems  is  to  detect  errors  at  the 
same  time  when  the  real-operations  are  performed.  This  means  that  a  system  does 
not  need  to  be  stopped  to  find  out  which  resources  are  faulty.  To  satisfy  this 
dq)radability  and  time  requirements,  the  mqfority  voting  schemes  are  implemented 
in  conq>uter  systems.  This  means  that  a  system  must  be  composed  of  at  least  three 
nodes  (nnodules)  [2]  which  are  performing  the  same  job  and  are  establishing  the 
valid  result  by  majority  voting.  In  general,  we  have  two  types  of  elementary  voting 
schemes  which  will  be  named  Centralized  and  Distributed  Voting  Architectures  or 
briefly  CVA  and  DVA,  respectively.  Moreover,  conqwsitions  of  these  fimdamental 
schemes  can  create  more  conq)lex  (hierarchical)  systems  which  satisfy  the  highest 
depoidability  requirements. 

In  the  literature  only  centralized  voting  systems  were  analyzed  very  attentively 
[1,3].  Presently  the  importance  of  distributed  systems  is  growing  rapidly,  so  the 
decentralized  voting  strategies  should  be  considered  and  conq>ared.  Some  ideas 
referring  to  the  hierarchical  systems  are  given  in  [4],  where  some  rollback  recovery 
strategies  are  analyzed.  In  [5]  the  matrix  and  channel  voter  based  architectures  are 
considered.  Note  that,  the  latter  corresponds  to  DVAs  defined  above.  In  the  paper 
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we  concentrate  on  the  dq)radability  analysis  of  the  basic  architectures.  We  also 
propose  the  systenutic  iq[>proach  to  evaluate  the  reliability  and  safety  of  more 
cooqilex  voting  architectures  that  are  the  composition  of  such  elementary  systems. 

System  reliability  R(t)  is  the  probability  of  the  correct  system  work  (success) 
during  a  certain  period  of  time.  System  safety  S(t)  is  the  probability  that  the  system 
will  survive  for  a  certain  period  of  time.  To  estimate  these  paran^ters,  Markov 
models  are  used  [3].  Based  on  these  models,  reliability  and  safety  of  system  nodes 
are  calculated  in  Section  2.  Then  elementary  centralized  and  distributed  voting 
schemes  are  analyzed  and  compared  in  Sections  3  and  4,  respectively.  Section  5 
introduces  con^rositions  of  the  elemmtary  systems.  The  package  program  evaluating 
dependability  of  different  voting  systems  is  presented  in  Section  6  and  its  functions 
are  givra  and  discussed. 


2  Dependability  of  System  Nodes 


Let  consider  that  a  con^)uter  system  consists  of  n  nodes  (processing  elements).  Each 
node  can  communicate  with  some  other  nodes  by  interconnect  lines.  Most  of  studies 
in  dependability  of  computer  voting  system  assume  that  a  system  remains  operable 
as  long  as  there  exist  suitable  number  of  fault-free  nodes.  In  consequence,  the 
dependence  of  system  nodes  have  direct  inqjact  on  the  total  system  dependability. 
In  this  section  we  concentrate  on  the  dependability  estimation  of  no-repairable  and 
repairable  system  nodes. 


o  Fault-frM 


Faultw 


Hodel-  a 


Hoctal-  b  Hoctel-  d 


Fig  i  Rgliabilitw  Hoctels  of: 
a>  Non-rapairablo  noda  b>  Partial  aalf-rapairabla  nocia 
c>  Salf-diract  rapairabla  noda  d>  Salf-indiract  rapairabla  noda 


Let  nodes  consist  of  processing  unit,  memory  unit(s),  and  input/output  unit.  In  case 
of  no-repairable  node,  each  fault  of  these  units  can  cause  the  failure  of  the  \^ole 
node  (see  Fig.  1-a).  Thus  we  can  assume  that: 

RnCO  =  e  ,  where  =  Xp  +  Xm  +  (1) 

Moreover,  it  is  highly  unreasonable  to  assume  that  each  single  node  should  totally 
fail.  Therefore  we  consider  three  types  of  self-repairable  nodes.  It  is  assumed  that 
either  a  node  can  survive  some  faults  (see  Fig  1-b)  or  full-success  repair  can  tak^ 
place  (Fig.  l-c,d).  These  models  can  be  described  by  the  differential  equations. 
Using  Laplace  transforms,  the  problem  is  reduced  from  a  set  of  differential  equation 
to  a  set  of  simultaneous  linear  equations.  For  the  model  given  in  Fig.  1-c,  they  are 
as  follows: 

SP,(S)  -  1  =  -XP,(S)  +  /*P3(S) 

SP2(s)  =  XP,(S)  -  €P2(S) 

Sp3(s)  =  e?^iS)  -  mP3(S) 

where: 

Pi(0)  +  P2(0)  +  P3(0)  =1,  and  P,(0)  =  1,  PjfO)  =  P3(0)  =  0. 


Solving  these  equations  we  obtain  expressions  with  variables  P|(S),  P2(S)  and  P3(S) 
which  can  be  transformed  directly  to  the  time  domain.  Then  we  obtain: 

P,(t)  =  A,  e"  sin((i>t+o,)  +  k,  (2) 

P2(t)  =  Aj  e“  sin(wt+a2)  +  k2  (3) 


279 


M^iere: 

a  »  0.5(2(Xfi+€fi+X«)  -  (X^+fi*+€^l'‘ 
a  *  ■0.5(X+h+€) 
g  »  M+« 

d,  =  M* 

dj  =  M 

k,  =  dl/(a2+«*) 
kj  =  dl/i%^+<a^) 
b  *  (i^-^+ag+dl)^ 

A,  »  (Via)  [{b*  +  «^2a+g)^}/{a* +«*}]'* 

Aj  =  (Via)  [{(a+dj)^  +  «2}/{a2+«2}]'‘ 

o,  =  arctan  «(2a+g)/(a^-<ij^+ag+d,  )  -  arctan  (Wa) 

Oj  =  arctan  ^/(a+dj)  -  arctan  (w/a) 

Then  the  reliability  and  safety  of  a  node  can  be  expressed  as  follows: 
Rfj  =  Pi(t)  and  =  1  -  Pjft). 

Fig.  2  shows  some  grq>hs  for  models  b  and  c  given  in  Fig.  1. 


3  Centralized  and  Distributed  Voting  Architectures 

CVA  is  the  classical  voting  architecture  (Fig  3)  where  each  computing  node  (CN) 
is  described  by  one  of  the  reliability  models  shown  in  Fig  1.  The  Centralized  Voter 
(CV)  is  made  up  of  the  Bus  Interfacing  Unit  (BIU),  which  receives  and  sends  some 
information  to  the  CN,  and  Voting  Unit  (VU)  which  in  turn  performs  a  majority 
voting  algorithm  to  establish  valid  results.  It  is  assumed  that  the  voter  must  be  a 
hard-core  unit  to  achieve  reliable  work  of  the  whole  system.  Then  the  CVA  may 
tolerate  of  maximum  off  faulty  CNs,  where  n^2/+l.  K  out  of  n  system  is  a 
generalization  of  the  voting  architecture  in  which  k  of  n  nodes  must  work  correctly 
to  perform  system  functions  [3]. 

We  assume  that  the  reliability  of  the  CVA  can  be  determined  as  a  function  of  the 
reliability  of  the  CN  -  RcN(t)  snd  the  reliability  of  the  CV  -  Rcv(0- 


Fia  3  OontrellZMt  Votlna  Architactura  <CUA> 
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Let  us  assume  that  all  CNs  are  idratical  and  have  the  same  failure  rates  Xcn>  and 
the  system  can  tolerate  up  to/^n-k  faulty  CNs,  where  k  is  the  minimmn  number 
of  CNs  that  must  be  operational  for  a  system  to  be  reliable.  The  structure-based 
reliability  assessment  of  the  system  can  be  assessed  by  assuming  a  parallel-series 
structural  configuration.  Then  we  define  [1,3]: 


Rcva  (1)  ~  Rev  (1) 


Rcn  (t)<'^‘>  (1-Rcn  (t))‘ 


(4) 


where: 

Xcv  -  is  the  failure  rate  of  CV  (Xcv  =  +  Xwj) 

The  safety  of  the  system  Scva(0  can  be  is  determined  similarly  as  follows: 


SeVA  (f)  ~  Sev  (0 


(1-ScN  (t))' 


(5) 
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Fie  4  Distributsd  Uoting  Hrcliitsctur*  (DUA) 


DVA  is  ( as  shown  in  Fig  4)  composed  of  computing  nodes  (CNs)  and  redundant 
buses.  The  nodes  transmit  data  over  the  buses  and  each  node  can  receive  data  from 
all  nodes  including  itself.  The  main  difference  betwera  CVA  and  DVA  architectures 
is  the  type  of  voting.  The  nodes  of  DVAs  contain  BIU  and  VU,  which  are  used  for 
conununication  and  distributed  voting  respectively.  An  example  of  such  architecture 
is  described  in  [5,6].  BIU  modules  arbitrate  between  two  CNs  when  both  of  them 
want  to  access  the  voter.  It  is  also  responsible  for  reconfiguring  a  faulty  CN  out  of 
the  system.  Then  each  VU  is  programmed  to  perform  voting,  which  is  a  more 
conqilex  operation  than  a  sinqile  con^Nuison  of  the  received  data  from  all  of  the 
nodes,  and  also  to  perform  the  error  logging.  When  voting  is  performed,  final  result 
is  transferred  to  each  node  by  its  respective  BIU.  Whoever  data  from  a  particular 
node  do  not  agree  with  the  data  from  the  other  nodes,  an  error  condition  is  latched 
in  each  node  which  detects  such  an  error.  The  latched  information  specifies  the 
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fiuilty  node  as  well  as  the  source  of  the  faulty  data.  No  hard-core  is  necessary 
because  voting  function  is  distributed  among  the  nodes  and  to  eliminate  /  faulty 
CNs,  n^2/+ 1.  In  case  of  Byzantine  faults  very  popular  in  case  of  the  distributed 
voting  n^3/+ 1  [7],  Reliability  of  DVA  is  evaluated  as  follows: 


Rdva  (f) 


-  E 


i-O 


Ro,  (t)<^‘>  (1-Rcn  0))‘ 


(6) 


The  Safety  of  the  system  Si>vA(t)  is  determined  in  the  same  way  as  follows: 

p-k  _  p 


i-0  \  1 


Sbva  (t)  =  E  I  i  ScN  (1-ScN  (tyy 


(7) 


4  Comparison  of  CVAs  and  DVAs 

Let  consider  the  reliability  of  the  architectures  presmted  in  Sections  2  and  3.  We 
assume  that  a  computing  system  consists  of  homogmeous  nodes.  This  means  that  all 
reliability  parameters  are  the  same  for  each  node.  Reliability  of  buses  is  described 
by  the  reliability  of  BIU.  Fig.  5  plots  the  reliability  curves  for  a  system  of  n=5 
(then  k=3)  and  for  two  different  schemes  (CVA  and  DVA).  We  assume  that  the 
reliability  of  the  voting  unit  and  bus  inteifKing  unit  are  higher  than  the  reliability 
of  other  units,  i.e.,  the  failure  rate  of  a  CN  =  the  failure  rate  of  the  MU  <  the 
fiulure  rate  of  the  VU  <  the  failure  rate  of  the  BIU. 

We  restrict  our  analysis  to  mission  time  less  than  5  years.  The  S-shl^)ed  curves  are 
obtained  which  are  typical  for  redundant  systems  [1,3].  Above  the  knee,  thra  CVA 
and  DVA  have  ^Mue  componoits  that  tolerate  failures  and  keq>  the  probability  of 
system  access  hi^.  Once  the  system  has  exhausted  its  redundancy,  however,  there 
is  merely  more  hardware  to  Bui.  Because  distributed  voting  systems  have  more 
redundant  conqranents  (except  CNs)  their  reliability  for  the  first  period  of  time 
(nearly  for  one  year)  is  higher,  thm  is  lower  in  comparison  to  the  CVA.  This 
tendency  is  also  true  for  saf(^  of  system. 


5  Hierarchical  Compositions  of  Voting  Schemes 

CoDq)lex  systems  are  typically  structured  hierarchically  in  multi-levels  organization. 
This  means  that  such  a  system  ccMisists  of  smaller  subsystems  (each  of  which  is 
either  CVA  or  DVA)  and  shared  buses.  We  introduce  a  new  class  of  voting 
ardiitectuies  named  imposed  voting  architectures.  They  are  defined  on  the  base 
of  the  composition  operation  [8].  The  sinq>lest  composition  is  a  cascaded  series  of 
CVAs  or  DVAs  [3].  Other  examples  are  ^own  in  Fig  6.  The  scheme  made  up  of 
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R<t>,  S<t> 


r  distributed  voting  subsystems,  vdiere  their  signals  are  coiqiled  by  a  cmitralized 
voter,  is  named  Coitralized  -  Distributed  Voting  Architecture  and  is  denoted  by 
CDVA.  Another  ardiitecture,  is  made  up  of  a  set  of  r  centralized  voting 
architectures  connected  by  redundant  buses,  is  named  Distributed  -  Centralized 
Voting  Architecture  and  denoted  by  DCVA.  We  assume  that  such  a  coDq>lex  system 
works  correctly  if  at  leas<  m  of  r  subsystems  are  fault-free  (m  out  of  r). 

It  is  easy  to  note  that  for  analysis  of  composed  majority  voting  systems,  we  may 
use  the  formulas  given  in  Section  2  and  3  provided  the  node  relii^ilities  formulas 
are  replaced  by  the  subsystem  reliabilities  formulas.  Fig  7  shows  the  curves  for  the 
first  two  types  of  voting  architectures  discussed  above  and  for  model-c  of  node 
reliability.  In  general,  DCVAs  are  more  reliable  than  CDVAs.  The  utilization  of 
sdf-repoiring  nodes  leads  to  an  increase  of  system  reliability  more  significantly. 


6  Reliability  and  Safety  Estimation  Package 

Below,  the  newly  developed  program  named  ”RASEP"  is  described.  RASEP 
(Rdiability  And  Safety  Estimation  Package)  is  dedicated  to  analysis  and  conqMrison 
of  hierarchical  voting  architectures.  The  main  modelling  objective  is  to  provide  the 
estimates  of  reliability  and  safety  of  complex  conqHiter  systems. 

The  prototype  version  of  RASEP  has  beoi  implemented  in  C  programming 
language  and  destined  for  a  single  processor  environment  of  an  IBM  PC  AT 
computer  running  under  the  DOS.  The  system  stnicture  of  RASEP  is  givmi  in  Fig. 


oun-i 


CUA-l 


DUA-a 


centr¬ 
al  IsMi 
Uatar 


CUA-a 


-  DUA-r 

•■II 

1  a  . . 

<a> 

<b> 

Fi«  6 

Exanplas  of  Coapoi 

Md  Uoting  Archltoeturas 

a)  CDUA 

b>  DOUA 

R<t>,  t<t) 

Paranatar 

Failura  Rata 

0.0015/1000  hours 
0.0008/1000  hours 
0.0010/1000  hours 
0.0005/1000  hours 

O.OOOa/1000  hours 
O.OOOa/lOOO  hours 


S<t>  -OCUA 


O  5 
Pirn  7 


A<t>  -CDUA 


R<t>  -OOUA 


55  1C5000  H 


Raliabililw  of  Con 
Cn4  (k=5>. 


ad  Uotins  Architaeturas 
rs3  <nsa>] 


8.  Presently,  the  prognun  is  woildng  for  reliability  models  of  system  units  presented 
in  Fig  1.  HowevM',  new  models  can  be  added,  because  the  choice  of  a  givoi  model 
is  pointed  by  unique  name.  The  hierarchical  system  architecture  is  described  by  the 
following  formula: 


X  [Y(U,).  YTO, 


Y(U,)1 


(8) 
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Fig  8  Tha  Structura  of  RASEP 


where: 

X,  Y  -  determine  the  type  of  architecture,  e.g.  CVA,  DVA, 

Uj,  i-  -  dmotes  either  the  kind  of  a  system  node  or  recursively  a  formula 

like  (8)  describing  architecture  of  the  subsystem, 

r  -  is  the  number  of  elemoits  and  it  can  be  different  on  each  level  of  the  description. 
For  exanq>le  architectures  of  Fig.  6  can  be  described  in  the  following  way: 

CVA(DVA„  DVAj, . .  DVA,),  DVA(CVA„  CVAj, . CVA,). 

It  is  easy  to  note  ^t  different  types  of  architectures  can  be  used  on  the  same  level 
e.g.; 

DVA(CVA„  DVAjCCVAj,  DVAJ). 

Based  on  the  formula  (8)  and  the  expressions  (1  -j-7),  reliability  and  safety  models 
are  generated  and  concrete  metrics  are  determined  for  givm  parameters  of  failure 
and  rq>air  rates.  The  all  figures  presrated  in  the  paper  are  obtained  by  the  RASEP. 


7  Conclusions 

There  are  some  real  conq>uting  systems  where  majority  voting  schemes  are 
in4)lemaited  [1,5,6, 7].  The  aim  of  this  paper  is  to  pay  attoition  to  a  new  possibility 
of  generation  of  various  voting  architectures  and  to  show  some  methods  of  their 
evaluation.  In  order  to  siq^KMt  modelling  and  evaluation  of  those  architectures,  a 
program  "RASEP”  has  been  built.  Using  this  package,  for  a  given  reliability 
requirements,  we  may  choose  the  suitable  architecture.  PresOTtly,  the  package  is  still 
under  development.  The  fault  coverage  and  error  latmcy  will  be  incliided  into  the 
program  as  well. 
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Abstract 

As  an  alternative  to  the  classical  approach  for  system  specification  on  the 
basis  of  a  formalised  general  purpose  language  a  graphical  and 
specialised  language  for  application  to  safety  critical  systems  is  outlined. 
The  architecture  of  the  language  is  constructed  in  accordance  with  the 
functional  and  tuning  requirements  typically  for  operationality  in  safety 
systems.  The  fundamental  and  generic  elements  of  the  language  are 
presented:  the  syntax  and  semantics  of  function  arxl  net  diagrams.  A  wide 
range  of  operational  behaviour  (functional  arxi  timing)  can  be  detennmed 
by  this  grai^cal  specification  technique,  several  ways  of  specification 
analysis  are  opened.  Some  examples  show  how  to  benefit  from  the 
combination  of  illustrative  graphical  demonstration  arxl  strictly  defined 
rules  for  their  interpretation. 


1  Introduction:  Universal  versus  Special  Language 

A  main  task  in  software  technology  is  the  computer  based  and  (as  far  as  possible) 
auttxnatic  development  of  complex  software  systems.  The  essential  basis  for  that  is 
settled  in  the  eariy  stages  of  the  development  process:  The  system’s  work  has  to  be 
specified  in  a  way  the  computer  can  understand  arxl  operate  with.  For  that  reason  and  in 
order  to  avoid  severe  misurxlerstarKlings,  the  elimination  of  which  often  requires 
enormous  efforts,  a  precise  formulation  of  the  interxled  system  and  its  design  is 
requited.  Formal  methods  for  system  definition  have  been  suggested  which  fulfil  these 
requirements  to  some  extent. 

The  classical  tqrproach  for  a  high-level  system  specification  is  the  formal  language 
representation  of  its  functionality.  There  are  sever^  language  cotKepts  [1,2,3],  almost 
all  based  on  the  data  type  description  of  the  system  properties:  The  idea  is  that  a  data 
type  is  not  just  a  definition  or  enumeration  of  its  admissible  values,  but  the  concept  of 


types  also  comprises  of  all  operations  that  are  meaningful  for  these  data  objects.  The 
way  to  the  system ’s  behaviour  is  opened  by  axiomatic  rules  regulating  the  relationship 
between  values  and  the  admissible  state  transfers.  The  system  |/roperties  result  from 
rewriting  sequences  according  to  the  stated  rules  and  according  to  algebraic 
transformation  principles  [4]. 

Most  of  these  languages  do  not  have  the  feature  to  project  qperational  behaviour  with 
re^)ect  to  time  constraints  and  synchronisation  of  different  processes  which  constitute 
the  total  system.  Another  difficulty  with  these  formal  languages  arises  because  of  their 
universality  :  To  cover  a  wide  range  of  applications,  the  vocabulary  of  these  languages 
has  to  be  very  elementary  (set-theoretic  notation  and  predicate  calculus  terminology), 
and  a  system  descrq>tion  normally  consists  of  a  complex  set  of  relational  rules.  Because 
only  the  relational  structure  of  a  system  is  formulated,  even  for  relatively  simple 
systems  the  consequences  of  the  stated  rules  and  the  final  behaviour  carmot  be  realised 
immediately.  Therefore,  for  proving  that  the  specified  system  meets  the  original 
requirements,  extensive  verification  procedures  have  to  be  carried  out  [4]. 

The  situation  charrges  if  the  universality  principle  for  the  specification  language  is 
dropped  and  a  formalisation  for  a  restricted  a|^lication  area  is  taken  into  consideration: 
The  language  vocabulary  for  a  special  technical  field  can  be  adjusted  to  the  particular 
subjects,  the  combination  rules  (the  grammar)  can  be  arranged  according  to  the 
requirements  of  the  special  field. 

The  safety  system  in  nuclear  power  plants  for  example  (or  similar  plant  protection 
systems)  fulfills  the  prerequisites  of  structurally  arxl  conceptually  restricted 
operational  technique.  Hie  input  quantities  are  regulated  and  the  operational  logics  for 
tte  safety  functions  are  constituted  from  elementary  functional  units,  because  the  safety 
functions  follow  simple  operational  patterns:  Data  acquisition  and  preparation, 
accident  control  by  evaluation  and  comparison  of  measurements  and  a  few  normed 
reaction  schemes.  Usually  this  operational  procedure  runs  simultaneously  and 
redundantly  on  different  computers  and  is  cyclically  repeated.  Therefore,  a 
synchronisation  mechanism  has  to  be  implemented  and  timing  constraints  have  to  be 
considered. 

Along  this  operational  paradigm  a  graphical  language  for  specification  and  design  of 
such  type  of  safety  system  is  outlined  in  the  next  sections.  For  that,  the  proposals  made 
in  [5,6]  for  designing  a  graphical,  high-level  language  are  taken  up  and  modified  for 
special  applications.  In  section  2  the  architectural  concept  of  the  language  is  discussed: 
Hie  two  essential  constituents  of  the  language,  functional  and  net  constructors,  are 
established.  In  section  3  a  more  detailed  description  of  these  two  language  features  is 
given:  The  comlnnation  of  functional  units  to  function  diagrams  is  defined,  and  a 
iqredal  class  of  time  Petri  nets  is  introduced  for  managing  synchronisation  and 
regarding  timing  aspects.  In  section  4  the  system’s  analysis  on  this  very  early  stage  of 
the  development  and  the  possibilities  of  a  direa  implementation  of  the  gnqihically 
qiedfied  system  is  discussed.  The  main  results  of  the  report  ate  summarised  and  a  short 
outlook  on  the  future  w(Hk  is  given. 
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2  Global  Architecture  of  a  High-level  Language  for 
Safety  Systems 

A  language  for  the  specification  of  control  and  safety  systems  should  be  designed  in 
accordance  with  the  operational  and  behavioural  patterns  of  diese  systems.  The 
language  features  have  to  be  adapted  to  describe  fee  functional  aixl  performance 
specialities:  The  vocabulary  and  the  rules  for  their  combination  has  to  reflect  fee 
properties  of  the  systems. 

Fbr  feat  purpose,  the  basic  functional  elements  for  safety  systems  have  to  be  isolated, 
the  basic  arrangments  for  building  up  complexer  fimctions  have  to  be  identified  arxl  the 
main  rules  for  integration  and  sync^nisation  of  the  usually  distributed  conputations 
have  to  be  fixed. 

As  fee  main  characteristics  of  typical  safety  system  one  can  state: 

-  Mostly  simple  functional  lines,  starting  wife  data  acquisition,  data  examination  arxl 
plausibility  checks,  and  then  performing  data  compariscm  arxl  evaluation  arxl  a  few 
reaction  schemes 

-  Starxlardised  boolean  or  arithmetical  operations  wife  the  data  values  (e.g.  selection, 
sort,  threshold  comparison) 

-A  straightforward  execution  of  control  fimctions  wife  few  branches  arxl  caiefiil 
looping 

-  Execution  of  fee  same  or  of  similar  functionality  on  different  computers  (caused  by 
the  requirements  of  redundarKy/diversity) 

-Rigorous  requirements  for  integration  and  syrx:hr(Miisation  of  the  distributed 
computer  system 

-  Strict  timing  requirements  arxl  cyclical  execution  routines. 

The  specification  language  has  to  take  care  of  these  aspects:  It  should  facilitate  the 
formulation  of  linear  functionality  and  provide  constructs  for  the  determination  of 
timing  arxl  performance  constraints.  In  correspondence  to  these  demaixls  the  language 
include  two  gr2q[)hical  forms: 

The  function  form,  represented  in  functional  diagrams  (see  section  3.1). 

The  fimcdmial  vocabulary  consists  of  a  set  of  predefined  units  (bricks)  to  the  designer ’s 
disposal  etufeUng  the  construction  of  a  large  class  of  complexer  control  fimctions.  In 
addition  to  feat,  the  qien  character  of  the  language  allows  the  integration  of  new,  user 
defined  units  for  spe^  applications. 

The  net  form,  re^nesented  in  net  diagrams  (see  section  3.2), 

is  used  for  the  arrangement  and  management  of  distributed  functionality,  forexpression 

of  sytK;htonisatioa  and  timing  requirements. 

With  these  language  constructors  a  program  for  a  distributed  safety  system  is  built  up: 
For  fee  different  computers  their  tasks  are  formulated  in  separated  function  diagrams. 


Dep«xiing  on  the  complexity  of  tfiese  functions  the  functional  specification  can  be 
divided  into  several  Uerarctucally  ordered  diagrams.  The  co-operation  of  die 
computers  is  co-ordinated  by  net  diagrams  which  manage  the  right  ordering  and  timing 
of  the  global  computation.  Out  of  the  net  process  other  functions  may  be  called  and  a 
cyclical  execution  procedure  may  be  established. 


3  Basic  Elements  of  a  High-level  Language 

3.1  The  Function  Form 

The  vocabulary  of  a  safety  language  is  a  collection  of  simple  functional  units,  from 
which  the  more  complex  safety  functions  can  be  built  up.  For  the  definition  of  the 
proper  words  of  the  specificatirm  language  one  has  to  make  a  compromise  between 
simplicity  and  complexity:  The  items  should  not  be  too  elementary  because  then  the 
more  crxnplex  constructs  loose  their  transparency;  on  die  other  hand  they  should  not  be 
too  extensive  and  should  not  contain  too  much  information  because  then  the  language 
looses  its  flexibility  and  power. 

A  language  for  safety  systems  in  nuclear  power  plants  has  to  provide  items 

-  for  manipulation  of  safety  indicators,  for  example 

•  simple  Boolean  operators 

•  simple  numerical  evaluators  (threshold  comparison,  sort  routines,  max/min 
checkers) 

-  for  special  technical  features,  as  they  may  be 

•  particular  controllers 

•  drivers  for  control  equipment 

-  for  operational  strategies,  as  they  are  required  for 

•  process  synchronisation  and  communication 

•  process  priority  and  access  regulatitm  to  commonly  used  resources. 

The  single  words  of  the  language  are  introduced  as  graphical  symbols,  the  combination 
of  which  is  ruled  by  the  inherent  properties  of  these  functional  units  (number  and  type  of 
input-output  relations).  The  semantics  of  the  language  are  given  by  small  genetic 
algorithms  or  models  connected  to  the  single  words  (see  below),  the  language  power  is 
determined  by  these  functional  bricks  and  their  combination  effects. 

To  specify  a  function,  the  input-output  relation  of  that  function  has  to  be  analyzed  and 
divided  into  smaller  functional  blocks  (depending  on  the  complexity  of  the  system) 
revealing  die  constructive  elements  of  the  specific  function.  Then  the  functional 
diagram  is  built  up  using  the  functional  units  of  the  language  as  constituents  and 
connecting  them  in  correspondence  to  the  structure  of  the  function. 
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This  analysis  and  partition  procedure  aims  at  the  reconstitution  of  the  function  from 
well-known  and  predefined,  indivisible  functional  units.  The  verification  process  later 
on  is  based  on  tte  complete  verification  of  these  constituents  and  is  going  along  the 
composition  rules  used  for  the  construction  of  the  complexer  fimction  (To  covers  larger 
range  of  r^iicatioas,  there  is  the  possibility  to  enhance  the  basic  vocabulary  by  new 
generic,  user  defined  elements). 

A  small  (data  acquisition  and  check)  example  should  illustrate  the  construction  process 
for  functional  diagrams: 


The  values  of  two  signals  (received  from  different  sensors)  have  to  be  checked  for 
plausibility.  Only  if  both  of  the  values  are  realized  as  valid  (not  less  than  a  fixed 
threshold  a  and  not  larger  than  b),  the  larger  one  is  selected  for  further  evaluation; 
otherwise  the  failure  of  data  acquisition  process  has  to  be  announced. 

A  specification  of  this  acquisition  process  is  presented  in  the  function  diagram  of  figure 
1  (where  a)  stands  for  invalid  value/failure  in  the  acquisition  process). 


. -  „ 

“S  LIAY 
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NUM  .'s  range  of  possible  values 

NuM  ;=  NUM  U  {oi) 


BOOL :»  {(T,  n 
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;  BOOL  X  BOOL  -►  BOOL 
an  usual  boolean  "AND” 


^  :  BOOL  X  NUM  X  ROW  -*  NUm 

max  {xi.xj)  if  b  at  1 


(b,xi,X2)  -*  I 


(O 


else 


Figure  1;  Function  diagram  example 
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3.2  The  Net  Form 


In  general,  the  function  diagrams  are  distributed  on  different  processors  or  computers 
and  executed  independantly  (according  to  the  requirement  of  redundant/diverse 
measurement  acquisition  and  evaluation  in  safety  critical  systems).  At  special  places 
the  results  of  these  calculations  have  to  be  assembled,  compared  and  an  ad^uate 
reaction  has  to  be  caused. 

For  the  purpose  of  combination  and  synchronisation  of  originally  separated  activities, 
special  language  features  have  to  be  prepared  enabling  the  integration  and  organisation 
of  different  computational  procedures.  So,  there  is  a  need  for  language  constructs  which 
allow  one  to  e}q>ress  in  a  formal  and  unique  way 

-which  procedures  (functional  diagrams)  should  be  combined,  integrated  and 
worked  up  together 

-where  and  how  a  synchronisation  of  the  itKlependantly  operating  units  has  to  be 
achieved 

-what  timing  requirements  have  to  be  fulfilled  (duration  of  calculation,  cyclical 
constraints) 

-  when  and  how  interrupts  are  allowed  and  how  to  continue  afterwards 

To  meet  widi  the  requirements  of  assembling  irxiividual  procedures  and  of  timing 
constraints  a  special  feature  is  introduced  into  the  language:  the  net  form.  The  syntax 
and  semantics  of  these  net  constructs  are  borrowed  firwn  the  well-known  Petri  net 
models  adding  timing  facilities  and  modifying  the  rules  for  the  dynamical  behaviour  of 
the  classical  Petri  nets.  These  extended  Petri  nets,  so-called  time  Petri  nets,  allow  the 
exact  specification  of  action  sequencing  in  accordatKe  with  logical  and  tuning 
constraints  (see  below). 

The  arrangement  and  combination  of  function  and  net  forms  is  roughly  sketched  in 
figure  2. 

For  synchronisation  and  combination  of  functions’  results  the  ouq>uts  of  die  different 
function  diagrams  are  collected  in  a  net  diagram.  The  actual  initiation  of  the  net  (leading 
to  an  initial  state,  see  example  in  figure  3  below)  is  caused  by  these  function  ouqiuts.  To 
the  single  transitions  of  the  net  there  may  be  allocated  ’’actions”  (other  function 
diagrams)  which  are  executed  in  the  case  of  transition’s  firing  and  which  also  may  use 
ouqiut  values  of  connected  function  diagrams  (dashed  line  in  figure  2).  The  sequencing 
of  transitions’  firing,  i.e.  the  development  of  the  system’s  activities,  is  determined  by 
the  structure  of  the  net. 

In  the  following  a  detailed  description  of  time  Petri  nets  and  their  use  within  the 
language  is  given. 


In  the  following  a  formal  definitimi  of  time  Petri  nets  is  given  according  to  [7,8].  For 
treating  the  special  requirements  stated  above  some  essential  modifications  to  the  firing 
rules  and  to  the  net  behaviour  are  introduced. 
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taactlon  dtegraas 


net  dtegram 


Figure  2:  Integration  of  Unction  and  net  diagrams 

A  TPN  is  a  tuple  (P,  T(  A),  F,  Mo,  SIM,  DIM),  where 

•  P  is  a  finite  set  of  places,  symbolized  as  circles 

•  T(  A)  is  a  finite  set  of  transitions  with  possible  actions,  symbolized  as  rectangles 
with  action  description 

•  FCPxTUTxPisa  relation  between  places  and  transitions,  symbolized  as 
arcs 

•  Mo  :  P  IN  is  an  initial  marking,  symbolized  as  black  tokens  on  the 
corresponding  net  places 

•  SIM:  T  -►  (time)intervals 

•  DIM.  T  -♦  (time)intervals 

The  SIM<t)-interval  ( a,,  )  characterizes  the  time  delay  for  transition  firing  and  may 

be  used  for  the  specification  of  synchronisation  and  time  sequencing  of  transitions  (and 
actions).  These  time  values  a,  and  /9,  are  relative  time  indicators,  relative  to  the 
moment  at  which  the  correspoixling  transition  is  enabled:  Assume  that  transition  t  is 
enabled  at  absolute  time  ,  then  t  may  fire  as  soon  as  possible,  but  not  before 

-t-o,  and  not  later  dian  .  These  values  ate  dynamically  updated  in 

conespondeoce  to  the  behaviour  of  the  net  (see  firing  rule  below). 
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If  there  is  an  action,  added  to  a  specific  transition  and  die  duration  cannot  be  neglected, 
the  DIM(t)-interval  characterizes  the  minimal/maximal  duration  of  this  action.  By  that 
duration  value  the  net  activities  and  the  general  behaviour  may  be  influenced  to  some 
extent  (see  firing  rules  below). 

Whereas  in  ordinary  Petri  nets  the  behaviour  is  ruled  only  by  the  markings,  in  TPNs 
additional  conditions  concerning  the  time  constraints  have  to  be  fulfilled  for 
enabledness  and  firing  of  transitions.  These  conditions  constitute,  together  with  the  i 

usualmarking,  the  state  of  the  net  and  determine  the  next  possible  activity.  In  general,  a 
state  S  of  a  TPN  consists  of  a  pair 

S  =  (M4),  where 

•  M  is  a  marking  of  the  net 

•  I,  the  firing  domain,  is  the  set  of  enabled  transitions  (by  marking  M)  together  with 
the  corresponding  firing  intervals  (The  number  of  entries  of  I  will  vary  during 
the  behaviour  of  the  net  according  to  the  number  of  transitions  enabled  by  the 
current  marking). 

Now,  let  us  assume  the  current  state  of  the  TPN  be  S  =  (M,I).  The  transitions  in  I  are 
enabled  by  the  marking  M,  but  not  all  of  them  are  allowed  to  fire  immediately  due  to  the 
timing  constraints  (of  the  firing  intervals).  The  ’’firability  condition”  for  a  transition 
c<mtains  a  time  parameter  and  expresses  the  fact  that  an  enabled  transition  may  not  fire 
before  its  left  time  interval  value  and  must  not  fire  after  its  right  time  interval  value 
(unless  another  transition  fires  before  and  modifies  marking  M  and  so  state  S). 

Accordingly,  transition  t  is  firable  firom  state  S  =  (M,I)  at  time  T^.  -f  r  iff 

i)  t  is  enabled  by  marking  M  at  time  T^ 

ii)  the  relative  time  r  is  within  the  actual  time  interval  for  t 

The  firing  rule  itself  is  introduced  by  the  definition  of  the  follower  state  S’  =  (M’ J’) 
reached  by  firing  an  enabled  transition  t  at  time  T*.  +  r  : 

i)  M ’  is  computed  according  to  the  firing  rule  in  ordinary  Petri  nets. 

ii)  The  new  firing  domain  I’  is  computed  from  I  in  three  steps; 

•  All  transitirais,  which  ate  disabled  when  t  is  fired,  are  removed  from  I 

•  For  a  transition  t  which  remains  enabled  by  M’  the  new  firing  interval  is 
calculated  as 

a'  =  max  {0,a-T-d} 

y9'=^_T-d 

(for  /S'  <  0  :  SIM(t)) 

•  For  a  newly  enabled  transition  t  introduce  as  firing  interval  SIM(t) 

In  other  terms,  the  first  step  corresponds  to  projecting  the  firing  domain  to  the 
transitions  that  remain  enabled  after  t  has  fired.  In  the  second  step  the  time  consumption 
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by  the  firing  delay  (and,  eventually,  a  tiine  duration  of  an  action  connected  to  t)  is 
regarded:  Hme  is  incremented  by  the  value  r  +  d  .which  appears  as  an  interval  shifting 
operation.  If  die  time  interval  has  completely  passed  (/?'  <  0 ),  the  transition  interval  is 

set  to  the  original  delay  SIM(t).In  the  third  step  the  domain  of  the  new  state  is 
suf^lemented  by  die  miginal  firing  intervals  of  die  newly  enabled  transitions. 

With  the  rule  “firing  transition  t  from  state  S  at  time  r  “  a  relation  on  the  set  of  possible 
states 

STATES  s  { (MJ):M  maridng  J  firing  domain  ) 
is  introduced  and  is  denoted  as 


S  S’ 

A  firing  schedule,  that  means  a  sequence  of  pairs  (t„ri), ....  (to,  rj  (t^  transitions,  tj 
times),  is  feasible  from  a  state  S  iff  there  exist  states  SI,  S2, ....  Sn  such  that: 

S  SI  S2...  Sn 

A  state  S’  is  reachable  from  state  S  within  time  T’  iff  there  is  a  firing  schedule 

tP 

(t„Ti),  ...,(t„„Tj  leading  from  S  to  S’ and  <T’ 

i~l 

Therefore,  the  firing  rule  permits  one  to  compute  states  and  a  reachability  relation 
among  them.  The  set  of  states  that  are  reachable  from  the  initial  state  or  die  set  of  firing 
schedules  feasible  from  the  initial  state  characterize  the  behaviour  of  the  TPN,  in  the 
same  way  as  the  set  of  reachable  markings  and  die  firing  sequences  characterize  the 
behaviour  of  ordinary  Petri  nets. 

For  instance,  let  us  consider  a  small  example  in  order  to  illustrate  the  firing  conditions 
and  the  behaviour  of  a  TPN  (see  figure  3a): 

The  net  is  initiated  providing  a  certain  value  and  allowing  an  mtemqition  demand  (the 
net  is  marked  at  the  places  1  and  2  from  an  outside  function).  The  further  treatment  of 
the  value  is  delayed  some  time  ( 6  ,  starting  at  the  moment  die  net  is  initiated). 

Ifno  interrupt  is  demanded  within  this  time  period  $  ,  die  transition!  will  fire  after  that 
time. 

If  an  interrupt  is  demanded  within  this  time  period  (a  token  spears  at  place  3),  die 
interrupt  transition  2  is  fired  and  a  possibly  connected  actitm  is  started  inunediately.  In 
this  (internet)  case  the  ongoing  behaviour  of  the  net  depends  on  the  duration  of  die 
inferrapt  action: 

Iftbe  interrupt  action  is  finished  before  2$  .the  transition!  is  still  enabled  and  will  fire, 
otherwise  the  token  is  removed  from  place  1  by  transition  3  and  the  execution 
(transition  1)  is  omitted. 

In  figure  3b  the  states  are  precisely  noted  and  tbeirrelational  combination  is  gr^hically 
presented. 


Places 


Tnuisitioiis 


1:  value  available  1:  executton 
2:  ready  for  interrupt  2:  interrupt  handling 
3:  interrupt  indicator  3,4:  deletion 


States: 

m  Marking  (1,1,0) 

enabled  transitions  and  intervals 
tl:[e,2d),  t3,t4:[2«,  00) 

m  Marking  (1,1,1) 

enabled  transitions  and  intervals 
tl  :  [e-T,2e-T),  t2:I0,0], 

t3,t4  :  [29 -T,  00) 

m  Marking  (1,0,0) 

enabled  transitions  and  intervals 
tl :  [max  [0,©-T-d|,2®-T-d), 
t3 :  i2»-r-d,  00) 

m  Marking  (1,0,0) 

enabled  transitions  and  intervals 
11:19,29), 

t3  :  [max  [0, 20-T-d),  oo)  s  [0,  oo) 


Hguie  3:  Intenupt  example  (a:  state  graph,  b:  net  specification) 

In  the  following  a  more  sophisticated  example  should  give  an  impression  about  the 
facilities  of  synchronisation  and  memory  mimagement  by  time  Petri  net  integration  of 
distributed  systems: 

Two  values,  received  from  different  computers  (output  of  different  function  diagrams), 
have  to  be  compared,exactly  at  the  time  points  9  ,  2$ ,  3$ ....  (cyclical 
synchronisation).  If  the  comparison  indicates  a  critical  situation,  an  exception  handler 
has  to  be  started,  if  the  ccunparison  irxlicates  normal  behaviour,  the  cyclical  execution  is 
started  again.  The  exception  handler  also  has  to  be  caUed,  if  one  of  the  values  (or  both) 
are  not  available  right  in  time  twice  (in  subsequent  runs). 

The  net  sfiecification  of  this  example  is  shown  in  figure  4.  There,  the  synchronisation  is 
established  by  die  transition  5,  which  fires  exactly  at  the  moments  9 ,20 ,30  ,...  After 
that  firing  the  evaluation  transition  4  is  enabted  for  firing  immediately,  if  both  of  die 
values  are  available  ((daces  1  and  2  are  marked  in  the  meantime).  If  only  one  of  the 
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values  is  missing  at  that  moment,  transition  6  (permanently  enabled)  gets  the  priority 
for  filing:  In  that  case  it  depends  on  the  memory  tc^en,  how  to  exit  the  net  fcmn  (token  at 
place  9;  tfieie  was  no  missing  value  before,  token  at  place  10:  there  was  already  a 
missing  value  in  the  last  computation  cycle). 


Hguie  4:  Synchronisatioa  and  evaluation  example 


4.  Conclusions:  Verification  and  Implementaion  of  the 
Safety  Language 


This  pi^rer  presented  an  s^sproadi  to  formally  specify  software  systems  in  a  graphical 
language  -  as  an  alternative  to  the  classical  formal  approadi  where  dte  relational 
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stiuctuie  of  a  system  is  given  in  elememtary  mathematical  terms.  The  architectural 
concept  for  designing  the  gn^hical  language  by  combining  functional  form  of 
lepiesentatioo  along  with  the  net  form  (time  Petri  nets)  has  been  described  in  some 
detail.  This  composition  feature  allows  the  description  of  the  functional  evaluations  as 
well  as  die  consideration  of  the  operational  behaviour  of  the  distributed  computation 
(concurrency,  synchronisation,  timing  ctmstraints). 

Rinction  forms  are  built  up  by  the  combination  of  a  fixed  set  of  basic  functional  units 
using  only  simple  composition  rules.  Some  investigations  revealed  that  about  SO  units 
(most  of  them  relatively  short  modules)  are  sufficient  for  the  specification  of  the  large 
majority  of  the  occuring  automation  problems  in  nuclear  safety  systems  [9].  Because  of 
the  limited  size  of  the  basic  units  it  should  be  possible  to  prove  formally  their 
correctness  using  for  example  the  rules  of  the  Hoare  calculus  [10].  For  the  verification 
of  cmnplete  function  diagrams  one  has  to  consider  following: 

-  Loops  and  condition  branches  are  packed  into  the  single  fimcdonal  units  which  can 
be  proven  separately  and  in  advance. 

-The  function  diagrams  themselves  are  ccxistructed  without  any  backward  arcs. 

)^th  these  prerequisites  in  general  a  function  diagram  can  be  translated  into  a 
(semantically)  equivalent  computation  sequence  consisting  of  the  single  functional 
constituents  (linear  ordering  of  the  function^  units).  On  the  premises  that  the  iiKlividual 
units  are  proven  to  be  correct,  the  verification  process  for  function  diagrams  is  limited 
to  the  checking  of  the  adequate  transformation  of  the  results  along  a  linear 
computational  sequence. 

The  net  form  enforces  a  careful  construction  of  the  system’s  concurrent  parts  and 
demands  an  erqplicitely  formulated  schedule  for  the  temporal  patterns.  Using  and 
extending  classical  Petri  nets  or  place-transition  nets  by  adding  time  features  allows 
one  to  simultaneously  model  the  behaviour  and  analyze  the  properties  of  timed  systems. 
This  technique  is  related  to  the  teachability  analysis  method  for  usual  Petti  nets:  The 
firing  rule,  modified  widi  a  tuning  coixlition,  permits  one  to  ctxnpute  states  and  a 
reac^bility  relation  among  them.  The  set  of  states  that  are  reachable  (along  firing 
schedules  feasible  from  the  initial  state)  characterize  the  behaviour  of  the  time  Petri  net 
atxl  determine  in  a  rigorous  way  the  actions  of  the  system. 

Using  tfus  set  of  states  for  analysis  purposes,  reasonable  computational  limits  may  be 
exceeded  for  complexer  systems.  Therefore,  to  find  exhaustive  and  computational 
analysis  methods  for  time  Petti  nets  will  be  one  of  the  main  tasks  for  the  future. 
Nevertheless,  the  examples  presented  in  the  figures  3  and  4  show  that  in  many  practical 
situations  the  TPN  approach  enables  an  exact  definition  and  verification  of  the  intended 
system’s  activities. 

Work  carried  out  in  this  direction  is  in  its  early  stage  of  development.  Up  to  now  the 
vocabulary  of  the  grt^hical  language  is  restricted  to  issues  relevant  to  the  operational 
behaviour  of  reactor  safety  systems.  A  graphical  editor  tool  will  be  developed  whereby 
it  will  be  possible  to  specify  a  system  by  generating  functional  diagrams  and  integrating 
them  into  suitable  net  forms  (on  a  grs^cal  screen  by  selecting  the  different  symbols 
displayed  on  a  separate  window,  connecting  them  by  arcs). 
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For  ttie  purpose  of  analysis  and  simulation  of  the  system  specification  the  semantical 

infonnation  about  the  constructed  function  and  net  forms  are  stored  in  the  background. 

Finally,  the  ultimate  goal  will  be  to  automatise  code  generation  directly  from  the  system 

specification. 
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Abstract 

The  behavior  during  execution  oi  an  Oreste  program  is  driven  by 
the  qrplication.  To  perform  reliability,  this  behavior  has  to  be 
always  defined.  The  failure  of  an  Or^te's  software  component 
execution  is  either  explicitly  recovered,  either  implicitly 
prqiiagated  to  the  caller  of  the  component  This  is  perftmned  by  a 
multi-tasking  extmisicm  of  programming  by  contract  (vganized 
panic  and/or  resumption  proposed  for  the  Eiffel  sequential 
language  by  B.  Meya*. 


1  Introduction 

This  work  is  based  upon  the  French  cmitribution  to  the  Programming  Language  fw 
Robots  (PLR)  [1]  developed  by  the  ISO  wtxking  group  4  of  TC  184  SC  2.  From 
this  contribution,  further  wcxk  led  to  the  d^inition  of  a  general  purpose  reactive  real 
time  language  called  Oreste  for  specifying  distributed  concurrence,  i.e  the  behavior 
during  execution  is  driven  by  the  tq^lication;  a  program  can  be  executed  on  a 
monoprocessor  calculatm*  or  on  calculator  network  without  shared  memory.  It 
requires  a  real  time  executive  on  every  station  and  an  inter-station  ctunmunication 
manager  which  provides  a  point-to-point  communication  with  on-receive 
synchronization. 

The  main  design  goal  of  Oreste  is  security  : 

•  the  compiler  should  detect  as  many  etrcn's  as  possible  ;  this  is  achieved 
through  explicit  declaration  of  every  program  entity,  and  strong  typing  ; 
furthermme,  the  behavior  of  programs  written  with  a  subset  of  the  language  can  be 
expressed  by  a  finite  states  automaton ; 

•  always  defined  behavior  during  execution  ;  based  on  a  multi-tasking 
extension  of  Eiffel's  {vogramming  by  contract,  an  on-line  deadlock  detection,  and  a 
clean  termination  of  concurrent  execution. 

An  Oreste  program  is  described  by  a  hierarchical  composititm  of  software 
components  ;  Oreste  defines  the  following  software  components  :  function. 
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procedure,  task  type  and  statements  that  include  usual  ones,  and  those  dedicated  to 
multi-tasking ;  f(^,  accept,  r^ly.  request,  wait  and  loopwait  statements. 

Every  software  component  either  succeeds,  either  fails  and  when  its  fails,  the  failure 
is  either  recovered,  either  {vopagated  to  its  caller.  This  is  performed  by  the 
programming  by  contract  pn^x>^  in  the  Eiffel  language  by  B.  Meyer  [2],  [3].  In 
this  p^r,  we  describe  in  a  first  part  the  exception  mechanism  and  the  programming 
by  contract  ddopted  by  Qreste,  then  the  Oreste‘s  task  and  f(^  and  the  propagaticHi  oi 
a  failure  in  a  multi-tasking  Oreste  program,  then  in  the  last  part  the  failure  of  the 
communication  statement 


2  Failure,  Exception  and  Contract 

2.1  Introduction 

Several  languages  introduce  exception  mechanians  for  dealing  with  abnormal  cases. 
Most  of  them  (as  in  Ada  [4],  CLU  [5])  are  not  safe,  i.e.  a  computation  can  fail 
without  propagating  the  failure  to  the  caller.  By  contrast  B.  Meyer  has  defined  for 
Eiffel  language  [23]  a  clean  and  safe  mechanism.  In  the  following,  we  summarize 
its  main  features. 

2.2  Eiffel's  Exceptions  Mechanism 

22.1  Default  Mechanism 

Eiffel  defines  the  following  software  components :  statements,  routine  (i.e  procedure 
or  function),  and  proposes  the  following  definitions  of  excq>tion  and  failure : 

An  exception  is  the  occurrence  of  an  abnormal  condition  during  the 
execution  of  a  software  component ; 

A  failure  is  the  inability  of  a  software  component  to  satisfy  its  purpose. 

Every  software  component  eitho^  succeeds  or  fails.  Failure  of  a  routine  statement 
execution  implies  the  failure  of  the  routine  call  statement  execution,  i.e.  the 
execution  of  the  routine  sequence  of  statements  is  aborted,  and  control  is  returned  to 
the  caller.  Routine  call  statement  execution  failure  is  just  a  particular  case  of 
statement  execution  failure :  so,  by  this  way,  failure  is  implicitly  propagated  to  the 
caller,  until  the  failure  of  the  top-level  routine  that  alx^  execution.  With  this 
default  mechanism,  the  occurrence  of  an  exception  led  to  ab(M  execution. 

2.22  Rescue  Mechamsm 

Eiffel  introduces  two  ways  for  providing  fault  tolerance : 

•  organized  panic ; 

•  resumption. 
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Syntactically,  fault  toleiance  is  expressed  at  routine  level  by  a  rescue  clause  which 
contains  a  sequence  of  statements.  Eiffel  allows  using  a  particular  statement,  the 
retry  statement,  rnily  in  rescue  clauses.  When  a  routine  statement  execution  fails, 
the  rescue  clause  sequence  of  statement  is  executed  ;  if  this  execution  fails,  the 
routine  call  statement  execution  fails  ;  if  it  succeeds  and  the  retry  statement  is 
invcAed,  the  routine  is  executed  again  from  its  Hrst  statement  (resumption) ;  if  it 
succeeds  without  invoking  any  retry  statement,  the  routine  call  statement  execution 
fails  (organized  panic). 

Note  that  routine  local  variables  are  always  set  to  their  default  values  during  routine 
invocation,  and  resumption  does  not  modify  any  variable :  so  re-execution  is  started 
with  current  values. 

Organized  panic  is  useful  just  for  restoring  a  clean  state,  by  example  fw  allowing 
resumption  at  a  higher  level.  In  either  case,  note  that  the  rescue  clause  does  not 
perform  the  purpose  of  the  routine. 

2.2  J  Programming  by  Contract :  Preconditions  and  Postconditions 

An  Eiffel  routine  can  be  completed  by  an  opticHial  precondition  and  an  optional 
postconditkm. 

A  fvecondition  exjnesses  condition  that  must  be  satisfied  by  the  ctHitext  and/w  the 
values  of  effective  routine  call  parameters  before  the  execution  of  routine  Hrst 
statement  Syntactically,  a  inecondition  is  defined  by  a  set  of  boolean  expressions, 
which  have  to  be  evaluated  true  ;  if  evaluation  fails,  or  if  one  of  them  is  false,  the 
routine  is  not  executed,  and  the  routine  call  statement  execution  fails. 

A  postcondition  exjnosses  conditions  that  must  be  satisfied  by  the  context  and/or 
local  variables  values  after  the  execution  of  the  routine  sequence  of  statements  has 
succeeded.  Syntactically,  a  postcondition  is  drfined  by  a  set  of  boolean  expressions, 
which  have  to  be  evaluated  true  ;  if  evaluatkm  fails,  at  if  one  of  them  is  false,  the 
rescue  clause  is  executed  as  defined  before. 

Preconditions  and  postconditions  are  the  basis  of  B.  Meyer's  pogramming  by 
contract  [2].  It  is  essential  to  express  this  contract  precisely.  A  contract  rqvesents 
the  sovice  that  a  software  component  has  to  perform,  and  necessary  conditions  to 
execute  it 


Oreste  has  adopted  Eiffel's  exception  mechanism  and  programming  by  contract ; 
however,  as  Eiffel  is  a  sequential  language,  the  exception  mechanism  has  to  be 
extended. 


3.  Concurrency  in  Oreste 

As  many  other  concuirent  languages,  concurrency  in  Oreste  is  expressed  thixmgh 
parallel  execution  of  sequential  tasks  ;  we  have  a^pted  the  Ada  definition  [4]  ;  a 
program  without  any  task  is  executed  sequentially  on  a  single  logical  processor. 
Tasks  are  entities  whose  executions  proceed  in  parallel  in  the  following  sense :  each 
task  has  its  own  logical  processor;  different  tasks  (different  logical  processors) 
proceed  independently,  except  at  point  they  synchronize.  However,  there  are  several 
differences  between  Ada  and  Oreste  tasl^s : 

•  Oreste  tasks  can  be  described  only  through  task  types ; 

•  an  Oreste  task  has  its  own  private  variables  (there  is  no  shared  data 
between  tasks) ; 

•  tasks  declaration,  instantiation,  and  execution  is  handled  by  a  new 
dedicated  statement,  the  fork  statement. 

3.1  Task  types 

In  Oreste,  we  cannot  describe  directly  a  processing  task  but  only  a  model  of  task  we 
call  a  task  type.  Every  task  type  can  be  instanced  in  cmler  to  obtain  a  j^ocessing 
task.  A  task  type  is  described  by  a  set  of  private  variables,  a  set  of  ports  and  a  set  of 
private  procedures  and  functions  ;  ports  defines  communication/synchronization 
potentiality  with  other  tasks  ;  one  paiticular  procedure  is  the  main  one  ;  execution 
of  an  instance  consists  of  executing  the  main  procedure.  An  Oreste  application  is  a 
set  of  task  types  and  its  execution  consists  of  instantiation  of  a  particular  task  type, 
the  root  task  type  and  to  execute  it. 

As  execution  of  a  task  is  sequential,  Eiffel's  exception  mechanism  applies :  so  a  task 
execution  either  succeeds,  either  fails. 

3.2  Basic  Fork  Statement 

32.1  Presentation 

The  mriy  way  to  exixess  concurrency  in  Oreste  is  using  the  fcnk  stmement,  which 
declares  a  set  of  ta^,  links  between  the  tasks  ports,  and  tasks  scheduling  through 
the  scheduling  clause.  The  scope  of  declared  tasks  is  limited  to  the  current  fork 
statement,  so  a  concurrent  execution  is  completely  embedded  within  a  fork 
stmemenL 

In  the  following  example  (figure  1),  the  fork  statement  declares  three  tasks  :  the 
tasks  Ti  and  t2,  instances  of  elsewhm  defined  task  type  tasictypei ,  and  the  task 
T3  instance  of  elsewhere  defined  task  type  tasktypez.  Theses  task  types  define  no 
port  (i.e.  thm  is  no  communication/synchrcmization  between  the  ta^  ti,  t2  and 
T3)  ;  the  scheduling  clause  specifies  pmallel  execution  between  the  execution  of  ti 
and  the  sequential  execution  of  T2  and  t3. 


fork 

kaak  T1,T2  :  typetaskl  ; 
task  T3  :  typetasK2  ; 

bagln 

T1  I  I  (T2  ;  T3) 

and  fork  ; 

_ Hgure  1 :  Example  of  Forte  Statement _ 

More  genoally,  the  scheduling  clause  is  an  expression  whose  (^p^ands  are  the 
declared  tasks  and  whose  opoators  are  sequential  and  parallel  opoa^.  Parenthesis 
can  be  used  to  show  grouping.  Each  declared  task  must  appear  exactly  once  in  the 
scheduling  clause. 

The  sequential  execudcMi  is  expressed  by  the  "  qjerator  which  is  associative ; 

A;  (B;C)  <=>  (A;B)  ;C  <=»  A;B;C 

The  parallel  executitm  is  expressed  by  the  ^ i  I”  op^ttx’  which  is  associative  and 
commutative : 

A|  I  (B|  |C)  «  (A|  IB)  I  |C  A|  |B|  |C 
A|  IB  <=>  B|  |A 

The  parallel  opoator  has  higher  priority  than  the  sequential  operator : 

A|  |B;C  <=>  (A|  |B)  ;C 

The  execution  of  the  basic  fcak  statement  ctxisists  of  the  following  st^ : 

•  instancing  all  declared  tasks,  including  setting  the  task  private  variables 
to  their  default  values ; 

•  concurrent  execution,  as  specified  by  the  scheduling  clause ; 

•  destructing  all  declar^  ta^. 

322  Execution  and  Failure  of  Fork  Scheduling  Clause 

Dqiending  from  the  execution  of  the  folk  scheduling  clause,  a  fork  statement 
execution  dther  succeeds,  eidier  fails. 

Scheduling  clause  execution  is  recursivdy  described  by  the  executitMi  of  sequential 
execution  of  its  terms  and  parallel  execution  of  its  factors.  Sequential  execution  of 
A;B,  where  A  and  B  ate  scheduling  clause  tnms,  is  defined  as  follows ; 

•  A  is  first  started ;  when  A  has  succeeded,  b  is  stvted ;  when  B  succeeds, 
the  execution  of  a;B  succeeds ; 


•  if  the  execution  of  A  fails,  B  is  not  executed,  and  the  execution  of  A ;  b 

fails ; 

•  if  the  execution  of  B  fails,  the  execution  of  A;  B  fails. 

Parallel  execution  (tf  A  |  IB,  where  A  and  B  are  scheduling  clause  factors,  is  defined 
as  fidlows : 

•  A  and  B  are  started  in  an  wd^  that  is  not  defined  to  the  language  ;  when 
bodi  have  succeeded,  the  execution  of  A I  iBsucceeds; 

•  if  the  execution  of  A  fails,  the  execution  of  A I  I B  will  fail  when  B  will 
terminate  its  execution  (by  a  success  or  by  a  failure) ; 

•  by  symmetry,  if  the  execution  of  B  fails,  the  execution  of  A  |  |  b  will  fail 
when  A  will  tmninate  its  executitxi. 

Note  that  failing  of  one  task  of  a  parallel  expression  does  not  directly  influence  the 
other  one,  because  ”<Ufferent  tasks  proceed  independently,  except  at  point  they 
synchronize"  ;  so  failing  of  a  task  affects  the  behavitv  of  other  ta^  only  when 
t^  wish  to  synchronize  widi  it 

The  propagation  of  the  failure  of  a  task  can  be  abstracted  as  follows  :  failure  of  a 
comptxient  of  a  sequence  causes  the  failure  oi  the  sequence ;  failure  of  a  ctmiponent 
of  a  parallel  ejq;)ression  does  not  affect  directly  odier  components,  however,  when  all 
other  components  have  succeeded  or  failed,  the  parallel  expression  will  fail. 

3.23  Example 

We  go  back  to  the  scheduling  clause  ci  the  figure  1 : 

T1  I  I  (T2  ;  T3) 

The  prc^Migation  of  the  failure  of  a  task  to  the  failure  of  the  fc^  scheduling  clause  is 
as  fc^wing : 

•  if  the  execution  of  the  task  ti  fails,  the  fork  scheduling  clause 
will  fail  when  the  execution  of  t3  will  tominate ; 

•  if  the  execution  of  the  task  t2  fails,  the  fork  scheduling  clause 
will  fail  when  the  execution  of  the  task  ti  will  terminate.  The  task  t3  is  not 
execitted; 

•  if  the  execution  of  the  tadt  t3  fails,  the  fork  scheduling  clause 
will  fail  when  the  execution  of  ti  will  terminate. 


3.3  Fork  Statement  with  Post  Conditions  and/or  Rescue 
Clauses 

A  forit  statement  can  be  completed  with  an  optional  postcondition,  and  an  optional 
rescue  clause. 


The  post  condition  is  evaluated  only  if  the  scheduling  clause  execution  has 
succeeded. 

The  rescue  clause  is  executed  whra  the  scheduling  clause  has  failed,  or  when  the  post 
condition  has  failed.  Qreste  does  not  defines  a  Eiffel's  retry  statement  (it  acts  as  a 
goto),  but  uses  a  predefined  boolean  local  variable,  called  ooRetry.  At  the  end  of  the 
execution  of  the  rescue  clause,  if  the  boolean  variable  Do  Retry  is  true,  the 
scheduling  clause  is  executed  again. 

As  fcv  an  Eiffel  routine,  execution  of  a  rescue  clause  leads  to  organized  panic  or 
resumption,  which  consists  of  executing  again  the  foric  declared  tasks.  Note  that,  as 
the  ta^  are  not  destructing  when  the  scheduling  clause  execution  terminates,  they 
are  not  re-instanced,  and  the  task  private  variables  are  not  resetting  to  their  default 
values. 

Note  that  although  a  fork  statement  execution  includes  concurrent  execution,  post 
conditkm  evaluation  and  rescue  clause  execution  are  sequential. 

The  programming  by  contract  and  its  multi-tasking  extension  provides  both  a 
debugging  mechanism  and  a  tool  for  fault  toterance  and  failure  recovoy. 

Hgure  2  shows  a  forte  statement  with  a  post  cmidititMi  and  a  rescue  clause ;  figure  3 
illustrates  the  effect  of  retrying  executitm  :  ti  and  t2  are  first  instanced,  and  then 
started ;  both  tasks  succeed ;  but  evaluation  of  post  condition  fails,  leading  execution 
of  die  rescue  clause,  that  sets  the  'ooRetry’  variable  to  true ;  So  ti  and  t2  are  re- 
executed  ;  both  tasks  succeed,  also  does  post  oMidition  evaluation  ;  both  tasks  are 
then  destnicted :  fcak  statement  execution  has  succeeded.  If  the  seemd  evaluation  of 
the  post  condition  should  have  failed,  the  second  execution  of  the  rescue  clause 
should  set  the  ‘ooRetry’  variable  to  false,  leading  to  the  failure  of  the  fork  statement 
execution. 

FirstAttempt  :=  true  ; 

fork 

kask  TI  :  typetaskl  ; 
kaak  T2  :  typetask2  ; 

begin 

TI  M  T2 

ensure 

—  boolean  expressions 

rescue 

DoRetry  FirstAttempt  ;  FirstAttempt  false  ; 

end  fork  ; 

—  statement  b 

Raire  2 :  Example  of  Fork  Statement  with  post  condition  and  rescue  clause _ 


4.  Failure  of  Communication  Statements 

Synchronization  and  communication  in  Oreste  comes  from  the  concept  of  “port'’ 
introduced  by  Silberschatz  [6].  It  achieves  synchronous  message  passing,  without 
explicit  naming  of  the  conespondenL  Every  port  has  one  and  only  one  owner,  and 
one  or  more  users.  An  Oreste  port  performs  [7] : 

•  for  users,  synchronous  sending-and-receiving  by  the  request  statement ; 

•  for  the  owner,  synchronous  receiving  by  the  accept  statement  and 
asynchronous  sending  by  Ae  reply  statement ;  every  accept  statement  execution 
must  be  balanced  with  a  reply  statement  execution ;  one  or  more  accept  statements 
can  appear  in  wait  and  loop  wait  statements,  corresponding  respectively  to  the 
altmiative  and  tq)etitive  commands  of  CSP  [8, 9]. 

A  task  declares  all  the  ports  it  owns  and  it  uses ;  theses  declarations  introduce  local 
names  and  specify  the  type  of  the  transmitted  information.  In  a  fork  statement,  link 
clauses  define  the  links  between  the  diffoent  ports  of  the  tasks. 

Note  that  using  local  names  allows  the  writing  of  general  type  tasks  :  by  contrast, 
Ada  entry  calls  either  explicitly  names  the  ccnresponding  task,  either  implicit 
naming  is  resolved  through  visibility  rules.  Furthermore,  as  all  potential 
interactions  are  declared  in  the  fork  statement,  all  potential  correspondents  are  always 
known. 

As  otho*  statements,  request,  accq)t  and  reply  statements  eitho-  succeed  or  fail. 

If  all  users  of  a  port  are  terminated  when  the  owner  invokes  an  accept  or  a  reply 
statement,  the  communication  statement  fails.  Also,  if  the  owner  of  a  port  is 
terminated  when  an  user  invokes  a  request  statement,  the  request  statement  fails. 

As  for  other  statemoits,  the  communication  statement  execution  failure  causes  the 
rescue  clause  of  current  fnocedure  to  be  executed.  This  ensures  that  a  task  never 
waits  for  a  communication  when  all  its  potential  correspondents  are  terminated  and 
allows  a  clean  and  automatic  tomination  of  tasks. 


5  Conclusions  and  Perspectives 


The  multi-tasking  extension  of  {n’ogramming  by  contract,  organized  panic  and 
resumption  provide  both  a  debbugging  mechanism  and  a  mechanism  ftn:  fault 
tolerance  and  failure  recovery  for  real  time  2q)plication  programs.  Failure  are 
recovered  mr  prc^ngated  on  run-time,  it  allows  an  always  defin^  behaviw  during 
execution  of  an  Oreste  program.  Also,  the  communication  lock,  when  all  the 
potential  correspcxident  are  terminated  can  be  avoided  by  the  failure  of  the 
communication  statement  However,  the  problem  remains  in  the  case  of  deadlock 
between  several  tasks.  The  behavktf  of  {vograms  written  in  a  subset  of  the  language 
can  be  exixessed  by  a  finite  states  automaton,  so  that  off-line  deadlock  detection  can 
be  performed.  For  the  full  language,  currrat  woric  is  in  progress  for  providing  an  on¬ 
line  deadlock  detection  mechanism. 
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Abstract 

The  history  of  attempts  to  secure  computer  systems 
against  threats  to  confidentiality,  integrity,  and 
availability  of  data  is  briefly  surveyed,  and  the 
danger  of  repeating  a  portion  of  that  history  is  noted. 
Areas  needing  research  attention  are  highlighted,  and 
a  new  approach  to  developing  certified  systems  is 
described. 


1  Introduction 

Concerns  about  the  security  of  data  processed  by  or  stored  in  computers  are 
probably  as  old  as  computing,  at  least  in  the  sense  that  some  of  the  earliest 
modem  computing  ma^nes  were  built  and  used  in  sensitive  applications  ~ 
for  example,  the  Polish  "Bombe"  and  its  British  descendants  that  were 
used  to  attack  German  ciphers  during  World  War  II  [1].  But  it  was  only 
with  the  development  of  large-scale,  shared  multiprocessing  systems  that 
computer  security,  in  the  sense  that  term  is  used  today,  began  to  be  an  issue 
of  general  concern.  The  advent  of  time-sharing  systems  in  the  late  1960's 
and  early  1970's  brought  the  difficulties  of  protecting  users  from  each  other 
within  a  single  computing  environment  into  sharper  focus,  because  people 
expected  to  store  data  for  long  periods  in  such  systems,  as  well  as  to  receive 
a  fair  share  of  interactive  computing  services.  This  paper  will  review 
briefly  some  of  the  history  of  computer  security  work  starting  from  that 
time,  summarize  some  of  the  lessons  we  have  learned,  sketch  a  recently 
developed  approach  to  developing  and  certifying  computer  systems  with 
security  requirements,  and  suggest  some  research  directions. 
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Because  words  like  "security"  and  "trust"  are  commonly,  but  imprecisely, 
used,  we  introduce  a  few  definitions  before  proceeding.  We  say  a  computer 
system  is  secure  if  it  can  preserve  the  confidentiality,  integrity,  and 
availability  of  the  data  it  processes  and  stores  against  some  anticipated 
set  of  threats.  Preserving  confidentiality  has  typically  denoted  protecting 
the  data  against  unauthorized  disclosure;  preserving  integrity  has 
denoted  preventing  its  unauthorized  modification;  and  preserving 
availability  has  denoted  preventing  its  unauthorized  withholding.  These 
definitions  are  perhaps  narrower  than  the  casual  reader  might  expect,  and 
indeed  they  have  provoked  some  debate  even  within  the  computer  security 
community.  Integrity,  in  particular,  continues  to  be  a  much^ebated  term 
[21.  A  computer  ^stem  or  component  is  trusted  if  we  rely  on  it  to  perform 
some  critical  function  or  preserve  some  critical  property  (such  as  security); 
it  is  only  trustworthy  if  we  have  evidence  to  justify  the  trust  we  place  in  it. 
A  computer  system  is  called  multilevel  secure  (MLS)  if  it  is  trusted  to 
separate  users  with  different  clearances  from  data  with  different 
classifications. 


2  Penetrate  and  Patch 

When,  in  the  late  1960's  and  early  1970's,  operating  system  developers 
(and  their  customers)  began  to  discover  that  their  operating  systems  were 
somewhat  less  secure  than  they  had  thought,  they  treated  security  flaws 
like  any  other  bugs,  and  installed  fixes.  Customers  who  were  particularly 
interested  in  the  security  of  their  systems  sometimes  hired  "tiger  teams"  to 
try  to  penetrate  them,  so  that  all  holes  might  be  found  and  patched.  One 
product  of  such  efforts  was  the  flaw  hy]x>thesis  methodology,  which 
suggested  an  informal  but  ^stematic  approach  to  this  activity  [3]. 

This  approach  to  computer  security  was  a  victim  of  its  own  success  -  not 
its  success  in  achieving  secure  computer  systems,  but  its  success  in 
penetrating  insecure  ones.  Every  time  a  new  person  or  group  attempted  to 
penetrate  a  system,  even  one  that  had  previously  been  penetrated  and 
patched,  new  holes  were  found  [4].  Although  records  were  sometimes 
collected  concerning  the  holes  found,  they  were  not  widely  circulated,  and 
many  of  them  have  since  been  lost  [51.  A  forthcoming  research  report 
collects  fifty  surviving  examples  and  proposes  a  taxonomy  for  organizing 
this  kind  of  data  [6].  Figure  1  reproduces  a  chart  from  that  report 
characterizing  the  genesis,  location,  and  time  in  the  system  life  cycle 
where  these  flaws  were  introduced.  The  taxonomy  and  charts  in  that 
report  are  intended  to  provide  a  helpful  method  for  abstracting  current 
flaw  data  so  that  future  attempts  to  improve  system  security  can  build  on  a 
stronger  empirical  base. 


Covert  Timing  Chan. 


Characteristics  of  computer  security  flaw  examples  (from  Ref.  6). 
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3  Technology  for  Thistworthy  Operating  Systems 


The  discouraging  results  of  penetrate  and  patch  activities  led  to  the 
realization  that  a  more  systematic  approach  to  building  secure  systems 
was  needed.  From  the  operating  system  notion  of  monitors  as  software 
components  that  controlled  access  to  critical  system  resources,  computer 
security  researchers  developed  the  concept  of  a  reference  monitor  that 
would  validate  that  all  references  issued  by  subjects  (executing  programs) 
to  objects  (memory  and  files)  were  consistent  with  an  access  control  policy 
[7].  If  the  hardware  and  software  needed  to  perform  the  reference  monitor 
functions  could  be  isolated  and  encapsulated  in  a  small  and  simple  enough 
part  of  a  system  so  that  high  confidence  in  its  correctness  could  be 
established,  that  component  would  be  called  a  security  kernel  [8].  For  some 
time,  researchers  labored  to  demonstrate  prototype  security  kernel 
implementations.  The  difficulty  of  actually  isolating  all  of  the  code  on 
which  security-enforcement  depended  proved  greater  than  had  been 
supposed,  but  it  nevertheless  seemed  clear  that  much  more  trustworthy 
systems  could  be  built  using  this  approach  than  by  the  penetrate-and- 
patch  method. 

To  persuade  vendors  to  build  more  trustworthy  systen\s  and  to  make  it 
easier  for  users  to  purchase  such  systems,  the  U.S.  government  established 
a  National  Computer  Security  Evaluation  Center  in  1981.  It  was  to  produce 
a  set  of  computer  security  evaluation  criteria  and  then  evaluate  products 
submitted  voluntarily  by  vendors.  Results  would  be  nnaintain^  on  an 
Evaluated  Products  List  that  could  be  used  to  qualify  products  for 
government  purchase;  this  qualification  provided  the  incentive  for 
vendors  to  subntit  their  products  for  evaluation. 

The  Trusted  Computer  System  Evaluation  Criteria,  better  known  as  "the 
Orange  Book,"  which  detined  seven  different  evaluation  levels,  appeared 
officially  in  1983.  It  defined  a  trusted  computing  base  as  the  "totality  of 
protection  mechanisms  within  a  computer  system ...  that  is  responsible  for 
enforcing  a  security  policy."  Thus  a  system  that  had  its  security 
enforcement  mechanisms  distributed  throughout  its  software  and 
hardware  could  be  said  to  have  a  trusted  computing  base,  but  not  a  security 
kernel.  Only  the  highest  two  levels  defined  by  the  TCSEC  (B3  and  Al) 
require  that  the  TCB  be  structured  to  exclude  code  not  essential  to  security 
policy  enforcement  —  that  is,  to  have  a  security  kernel. 

The  Orange  Book  was  tacitly  based  on  an  abstraction  of  the  dominant 
computing  model  of  the  1970's:  a  shared,  central-server  timesharing 
system.  As  the  computing  world  shifted  toward  workstations  and 
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networks,  it  became  clear  that,  at  the  very  least,  some  additional  thought 
was  required  to  see  how  to  apply  the  Orange  Book  in  this  context.  One 
result  of  this  re-thinking  was  the  Trusted  Network  Interpretation  of  the 
rCS£C(the  "Red  Book"),  published  in  1987. 

It  also  became  dear  that  having  an  operating  system  that  could  separate 
differently  classified  information  from  differently  cleared  users  was  not 
the  same  thing  as  having  an  application  that  provided  a  useful  MLS 
interface.  The  operating  system  might  support  a  variety  of  applications, 
each  operating  at  single  security  levels  or  it  might  support  several 
instances  of  the  same  application,  with  each  instance  operating  at  a 
different  (single)  level.  But  to  support  a  single  application  that  operated 
across  a  range  of  security  levels  required  that  some  of  the  security 
enforcem^t  be  provided  by  the  application  —  the  product  TCB  would  have 
to  be  modified  or  extended  to  incorporate  parts  of  the  application  in  this 
case.  To  address  this  issue,  particularly  in  the  context  of  database 
management  systems,  the  Trusted  Database  Interpretation  was  published 
in  1991. 

The  rest  of  the  world  did  not  sit  still  during  this  period.  Toward  the  end 
of  the  WSfys,  France,  Germany,  the  Netherlands,  and  the  United  Kingdom 
produced  the  Information  Technology  Security  Evaluation  Criteria  (the 
ITSEC)  -  a  'Tuumonised"  version  of  evaluation  criteria  created  jointly  by 
representatives  from  all  four  countries.  The  ITSEC  permit  greater 
flexibility  than  the  TCSEC,  in  that  vendors  and  customers  can  separately 
specify  functional  requirements  and  assurance  requirements,  which  are 
joined  in  the  Orange  Book  classes.  The  ITSEC  also  attempt  to  define  a 
structure  that  supports  both  product  and  system  evaluation,  although  their 
utility  for  the  latter  role  is  not  universally  accepted.  The  Canadian 
Trusted  Product  Evaluation  Criteria  (CTCPEC)  also  permit  separating 
function  and  assiu-ance,  but  (as  their  title  indicates)  is  restricted  to  product 
evaluations.  However,  the  CTCPEC  have  gone  farthest  in  explicitly 
addressing  integrity  and  denial-of-service  (availability)  issues. 


4  What  We  Can  Do  Today 

Many  products  have  been  built  and  submitted  for  evaluation  in  the  decade 
since  the  Orange  Book  first  appeared.  As  of  June  1992,  the  U.S.  Evaluated 
Products  List  included  over  a  dozen  products  rated  C2,  four  rated  Bl,  two 
rated  B2,  and  one  at  B3;  in  addition  there  are  two  systems  that  have  been 
evaluated  as  network  components  under  the  TNI;  one  received  a  B2  rating 
and  the  other  an  Al.  Products  have  also  been  evaluated  successfully 
against  the  ITSEC  and  the  CTCPEC. 
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On  this  basis,  it  seems  fair  to  conclude  that  today  we  can  specify,  design, 
and  build  operating  system  products  to  meet  the  requirements  reflected  by 
Orange  Book  levels  D  through  B3.  We  also  know  how  to  evaluate  products 
against  those  criteria,  and  there  is  some  evidence  that  products  satisfying 
the  higher  levels  of  the  criteria  are  indeed  more  difficult  to  penetrate 
than  those  that  don't. 

These  conclusions  should  be  tempered  with  the  understanding  that  the 
evaluation  process  is  still  typically  long,  arduous,  and,  particularly  at  the 
higher  assurance  levels,  expensive.  This  fact  is  partly  attributable  to  the 
way  that  the  evaluation  process  has  evolved,  as  Steve  Lipner  noted  in  his 
insightful  paper  presented  two  years  ago  [9],  but  it  is  also  attributable  to 
the  fact  that  developing  high  assurance  systems  based  on  security  kernels 
requires  greater  control  and  documentation  of  the  system  engineering 
process,  and  particularly  the  software  engineering  process,  than  most 
developers  customarily  provide. 

Something  else  that  we  can  (and  do)  do  today  is  plug  together  systems 
from  products,  including  workstations,  local  area  networks,  routers, 
gateways,  and  software  from  a  wide  variety  of  sources.  Often  these 
systems  perform  their  functions  quite  effectively,  but  the  security  they 
provide  is  usually  hard  to  determine  and  hard  to  control,  even  if  the 
security  properties  of  the  individual  components  are  known. 

5  Problems  We  Face 

There  are  many  problems  that  need  to  be  solved  before  we  will  see  the 
widespread  and  effective  application  of  trustworthy  computing 
technology.  A  few  of  these  problems  are  highlighted  in  this  section;  the 
following  section  reports  some  recent  advances  in  addressing  the  first  of 
them. 

We  need  less  costly  product  evaluation/system  certification  techniques 
that  can  be  applied  more  quickly,  but  mth  effectiveness  at  least  equal  to 
current  approaches.  Problems  with  the  current  product  evaluation  process 
in  the  U.S.,  as  described  by  Lipner  [9],  have  stimulated  attempts  to 
improve  it.  Unquestionably,  better  methods  are  needed,  but  there  seems  to 
be  an  increasing  tendency  in  some  quarters  to  accept  commercial  off-the- 
shelf  products,  together  with  some  form  of  testing,  as  suffleient  to  assure 
the  security  of  a  product.  Vfe  must  be  sure  that  this  tendency  does  not 
simply  lead  us  back  to  the  discredited  "penetrate-and  patch"  paradigm. 
Research  areas  relevant  to  this  problem  include  techniques  for  assessing 
software  development  methods,  techniques  for  documenting  and  assessing 
software  specifleations  and  designs,  tools  and  methods  for  product  testing, 
and  reverse  engineering  methods. 


A 


319 


Vie  rued  better  ways  to  understand  the  security  provided  by  composite 
and  distributed  systems.  The  only  reason  that  we  can  build  systems  today 
by  plugging  them  together  is  that  there  is  a  degree  of  standardization  at 
ti«  level  of  physical  connectors  and  device  protocols.  These  standards 
permit  the  possibility  of  component-based  system  engineering.  We  are  not 
likely  to  be  able  to  deduce  much  about  the  security  provided  by  systems 
built  this  way,  however,  until  at  least  a  comparable  level  of 
standardization  of  the  security  functions  and  assurances  provided  by 
components  is  achieved.  Further,  we  must  take  into  account  the  potentially 
world-wide  distribution  of  modem  systems,  which,  particularly  in  the 
commercial  sphere,  often  means  that  reliable  authentication  of  the 
originator  and  recipient  of  a  message,  rather  than  the  confidentiality  of 
its  contents,  is  the  paramount  security  concern.  Research  areas  that 
address  this  issue  encompass  abstract,  formal  work  on  security  modeling 
techniques  that  support  composition  and  decomposition,  methods  and  tools 
for  reasoning  about  and  finding  flaws  in  cryptographic  protocols,  and 
concrete  approaches  for  standardizing  security  function  and  assurance 
requirements. 

Vie  rued  better  ways  to  control  the  security  functions  that  current  (and 
future)  systems  provide.  Many  actual  security  problems  occur  not  because 
security  controls  are  lacking  but  because  the  existing  controls  have  not  been 
set  up  correctly.  Steve  Kent  of  BBN  Communications  has  observed  that  the 
US  is  a  country  of  people  who  are  unable  to  program  their  videotape 
recorders,  yet  we  are  building  interfaces  to  our  security  controls  that  are 
much  more  complex  than  the  average  VCR  control  panel.  It  is  difficult  to 
find  much  research  aimed  at  this  problem  presently,  but  work  to  identify 
common  requirements  for  application-based  security  controls  and  to 
develop  user  and  administrator  interfaces  to  them  that  are  based  on  work 
in  the  area  of  human-computer  interaction  could  lead  to  significantly 
improved  security  in  practice. 

We  need  to  develop  practical  methods  for  building  high  assurance 
systems.  There  are  definite  needs  for  systems  that  can  provide  very  high 
confidence  that  they  will  not  have  security  failures.  The  leading 
technology  for  developing  high  assurance  sohware  is  to  apply  formal 
techniques  to  its  speciBcation  and  development.  Although  a  recent  study 
shows  increased  industrial  application  of  formal  methods  [10],  their  use  is 
still  seen  as  a  significant  cost  factor,  and  there  is  uncertainty  as  to  whether 
they  can  be  successfully  applied  in  large  projects.  Further,  it  is  difficult  to 
assess  the  cost-effectiveness  of  their  application  because  it  is  hard  to 
quantify  the  security  provided  by  the  resulting  system.  Imaginative 
approaches  are  needed  to  organize  systems  so  that  requirements  for  high 
assurance  software  are  kept  to  a  minimum.  We  need  practical 
methodologies  for  exploiting  formal  methods  on  those  portions  of  systems 


that  unavoidably  require  high  assurance,  and  we  need  methods  to  estimate 
or  measure  the  security  actually  provided. 

Vie  neai  to  broaden  the  scope  of  “security'*  and  to  develop  methods  for 
addressing  security  properties  in  conjunction  with  other  critical  system 
properties.  Few  systems  are  purchased  strictly  to  provide  security, 
lypically,  a  customer  requires  a  system  to  perform  some  function  -* 
communication,  record  keeping,  real-time  control,  etc.  —  and  may 
acknowledge  that  to  perform  the  function  properly,  some  security 
requirements  must  be  met  as  well.  In  commercial  applications, 
confidentiality  may  frequently  take  a  back  seat  to  integrity  and 
authenticity,  and  availability  may  be  the  strongest  security  concern.  In 
control  systems,  timely  delivery  of  results  may  be  paramount.  If  we  are  to 
build  systems  that  incorporate  security  as  well  as  the  other  properties 
users  require,  we  need  techniques  for  developing  designs  that  can  meet  a 
variety  of  critical  requirements  and  that  permit  a  system  designer  to  make 
rational  trade-offs  among  them.  Research  that  permits  quantification  of 
covert  channel  bandwidths,  for  example,  is  a  st^  in  this  direction  to  the 
extent  that  it  permits  us  to  quantify  the  rate  at  which  a  particular  system 
design  permits  infoimation  to  leak  [11].  Work  to  model  denial  of  service 
protection  is  similarly  relevant  [12]. 

6  Developing  and  Certifying  Thistworthy  Systems: 

A  New  Approach 

As  noted  above,  we  cannot  at  present  develop  and  certify  the  security 
properties  of  integrated  systems  nearly  as  well  as  monolithic  products.  In 
this  section,  we  briefly  describe  an  informal,  but  structured,  approach  to 
system  development  and  certification  developed  recently  at  ^e  Naval 
Research  Laboratory.  This  approach  has  yet  to  be  applied  in  sufficient 
detail  to  a  large  example  to  permit  us  to  make  strong  claims  about  its 
effectiveness,  but  it  is  based  on  concepts  proven  in  our  earlier  work  [13,14]. 
It  has  strong  intuitive  appeal,  both  as  a  way  to  address  security 
requirements  during  system  development  and  as  a  way  to  explain  to  the 
accreditor  (the  person  responsible  for  deciding  whether  to  permit  the 
system  to  be  operated)  what  security  the  ^tem  provides  and  what  risks 
its  operation  would  pose.  Certification  denotes  a  technical  assessment  of 
the  ability  of  a  system  to  meet  specified  technical  standards  (e.g.  for 
enforcing  security  requirements).  A  more  comprehensive  description  of  this 
approach  has  recently  appeared  [15]. 

The  approach  is  based  on  recording  assertions  and  assumptions  that 
capture  the  system  security  requirements  within  the  framework  of  a 
documented  assurance  strategy.  At  the  beginning  of  the  project  this 
strategy  records  both  an  initial,  high-level,  abstract  version  of  the 
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assurance  argument  for  the  system  as  well  as  the  plan  for  creating  the 
final,  more  detailed  and  concrete  assurance  argument  that  will  form  the 
primary  technical  basis  for  the  certification  decision.  As  the  project 
progresses,  the  assurance  strategy  is  elaborated  to  reveal  the  increasingly 
detailed  outline  of  the  assurance  argument,  which  demonstrates  that  the 
system  as  designed  and  built  actually  satisfies  its  security  requirements. 
This  argument  will  not  exist  as  a  separate  document;  the  final  assurance 
strategy  will  in  effect  be  an  index  to  other  parts  of  system  documentation 
(software  specification  and  design  documents,  test  plans  and  results,  etc.) 
that  provide  the  "nuts  and  bolts"  of  the  assurance  argument.  \^th  this 
approach,  certification  of  a  trusted  system  can  largely  be  accomplished  as 
an  audit  of  the  development  process. 


For  a  given  system,  assertions  are  predicates  that  are  enforced  by  the 
system,  and  assumptions  are  predicates  that  must  be  enforced  in  the 
system's  environment.  The  system  itself  is  unable  to  enforce  its 
assumptions,  but  must  rely  upon  then\.  Together,  assumptions  and  assertions 
represent  what  must  be  true  of  the  system  and  its  environment  to  satisfy 
the  security  policy.  If  an  assumption  or  an  assertion  is  false,  a  security 
violation  may  occur. 
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For  example,  consider  a  medical  information  system  used  by  physicians, 
nurses,  and  pharmacists  within  a  single  hospital  to  record  the  current 
symptoms,  diagnosis,  treatment  plan,  and  billing  information  for  each 
patient.  Suppose  that  the  ^stem's  security  policy  requires  that  (1)  only  an 
administrator  can  create  a  new  patient  record  or  enter  authorizations  for 
doctors,  nurses,  and  pharmacists  to  use  the  system,  (2)  only  physicians  may 
update  the  recorded  diagnosis  and  treatment  plan,  (3)  nurses  may  update 
the  record  of  symptoms  and  medication  administered,  and  (4)  pharmacists 
may  read  the  treatment  plan  but  can  only  update  the  billing  information. 
Finally,  (5)  patients  are  prohibited  from  any  access  to  the  system. 


The  architect  of  such  a  system  has  a  number  of  security  disciplines 
available  to  help  satisfy  system  security  requirements,  including  personnel 
security,  physical  security,  procedural  security,  communications  security, 
computer  security,  and  others.  The  system  architect  typically  seeks  the 
most  cost-effective  combination  of  methods  drawn  from  these  disciplines 
that  will  satisfy  the  overall  system  security  policy  in  the  face  of 
anticipated  threats. 

If  the  system  architect  decided  to  rely  primarily  on  the  discipline  of 
computer  security  to  enforce  the  security  policy,  most  of  the  predicates 
would  be  enforc^  by  the  medical  information  system  software,  and  they 
would  be  assertions  about  that  software  system.  If  the  architect  chose  to 
rely  on  personnel  and  procedural  security  measures  (e.g.,  by  training  the 
users  in  their  roles  and  relying  on  them  to  invoke  only  the  system  functions 
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appropriate  to  those  roles)  then,  from  the  standpoint  of  the  information 
system,  all  of  the  predicates  would  be  assumptions  about  the  environment 
in  which  it  operates. 


The  assurance  strategy  provides  a  framework  for  recording  the  assertions 
and  assumptions  according  to  a  chosen  system  security  architecture. 
Suppose  a  design  is  developed  that  reqmres  the  administrator  to  use,  and 
to  write  down  on  paper,  a  password  for  authentication  purposes.  Figure  2 


COMPUSEC 


Assertion 

1.  Only  sdministrstor  can 
authorize  physician. 

2.  Only  physician  can  update 
diagnosis  and  treatment  plan. 


Assumptions 

1 .  Administrator  assigns  IDs 
and  roles  for  physicians, 
staff,  etc.  pr(^)erly. 

2.  Administrator  keq>s 
password  in  safe. 


PERSONNEL SEC 
Assertion 

^Administrator  receives 
proper  training  in  system 
juse. _ 

Assumption 
1.  Administrator  doesn't 
make  mistakes. 


PHYSICAL  SEC. 


Assertion 
Administrator  has  safe. 


Assumptions 
.Administrator  keqps 
safe  locked. 

2.  Safe  meets  specs  for| 
tamper-resistance. 


Figure  2.  Part  of  an  assertion  strategy. 


illustrates  part  of  an  initial  assurance  strategy  for  the  medical 
information  system  example.  Notice  that  each  assumption  is  supported  by 
an  assertion  from  another  discipline  or  else  nuips  to  the  "vulnerabilities" 
symbol.  In  this  representation,  assertions  do  not  map  to  other  parts  of  the 
framework;  they  form  a  set  of  requirements  to  be  satisfied  by  the 
discipline  in  which  they  occur.  The  full  assurance  strategy  for 
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COMPUSEC,  for  example,  would  have  to  explain  what  kind  of  assurance 
would  be  provided  that  the  COMPUSEC  assertions  are  enforced  (e.g., 
through  use  of  a  trusted  computing  base  supporting  access  controls). 

A  certifier  can  review  the  assurance  strategy  at  the  beginning  of  the 
development  and  decide  at  that  time  whether  this  strategy  (if  followed) 
is  likely  to  produce  an  acceptable  assurance  argument  at  delivery.  The 
assurance  strategy  may  be  modified  during  the  development,  since  as 
design  tradeoffs  are  made  it  may  be  appropriate  to  modify  the  kind  or 
degree  of  assurance  (e.g.,  code  reviews,  formal  verification,  testing) 
required  to  support  particular  parts  of  the  assurance  argument  that  is  being 
created.  It  allows  certifiers  to  assess  the  role  various  design  decisions 
play  in  the  overall  assurance  argument  and  to  determine  whether  the 
proposed  assurance  techniques  are  effective  for  demonstrating  the  validity 
of  the  decision. 

This  approach  is  not  a  panacea;  correctly  defining  the  security  policy 
and  creating  a  complete  assurance  argument  will  continue  to  be  a 
challenging  task.  But  it  does  promote  the  integration  of  system  and 
security  engineering,  it  can  reduce  the  risk  that  unforeseen  certification 
issues  will  impede  system  development  and  delivery,  and  it  can  make  the 
risks  of  operating  the  system  nwre  clearly  visible  to  the  accreditor. 

7  Summary  and  Conclusion 

We  have  reviewed  briefly  the  development  of  trustworthy  computing 
technology.  Readers  should  particularly  note  the  lessons  of  the  "penetrate 
and  patch"  approach,  so  that  we  do  not  repeat  the  experiences  of  that  era. 
We  have  not^  needs  for  improvements  in  system  certification  methods,  in 
understanding  security  implications  of  composite  and  distributed  systems, 
in  the  control  interfaces  for  security  functions,  in  methods  for  developing 
high  assurance  systems,  and  in  integrating  security  and  other  critical 
properties.  Finally,  we  have  glimpsed  a  new  approach  to  system 
certification. 

How  far  can  we  trust  computers?  We  already  trust  them  to  fly  our 
airplanes  and  rockets,  and  it  is  certainly  easier  to  purchase  a  computer 
system  and  know  its  security  properties  now  than  it  was  ten  years  ago.  But 
we  still  have  far  to  go  to  before  we  can  make  rigorous  statements  about  the 
trustworthiness  of  the  software  in  our  trusted  systems. 
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Abstract 

We  propose  a  security  audit  trail  analysis  approach  based 
on  predefined  attack  scenarios  and  using  genetic 
algonthms.  This  paper  shows  the  validity  of  this  approach 
and  presents  some  of  its  problems. 


1.  Introduction 

Several  approaches  exist  in  computer  security:  (i)  enforcing  users’  respect 
to  particufar  rules  when  they  are  using  the  system,  (ii)  identifying  threats 
to  the  s)^tem  (COPS  [1]  is  the  best  known  tool  in  the  UNIX  world  for 
such  an  identification)  and  (iii)  recording  some  or  all  actions  performed 
on  the  system  in  order  to  analyze  the  trail  to  detect  some  attacks  (it  is  an 
after-the-events  detective  control). 

The  third  approach  is  known  as  "securify  audit  trail  analysis"  (SATA) 
and  seems  especially  important  because  [21  (i)  even  the  most  secure 
systems  are  vulnerable  to  le^al  users’  misuses  (audit  trails  may  be  the  only 
means  of  detecting  authorized  but  abusive  user  activity),  (ii)  existing 
systems  have  security  flaws  and  thus  are  vulnerable  to  attacks,  (iii) 
substitution  of  existing  systems  by  secure  ones  is  not  easy  to  achieve  for 
economic  reasons  ana  (iv)  development,  installation  and  management  of 
secure  systems  is  not  an  easy  task. 

So  we  need  to  record  events  in  audit  trails.  However  there  are  some 
problems:  (i)  this  approach  is  very  expensive  in  terms  of  disk  space,  (ii) 
Its  impact  on  the  systems  performance  is  noticeable  and  (iii)  its  efficiency 
is  low  because  the  security  officer  (SO)  has  to  manage  such  a  huge 
amount  of  data  recorded  that  it  is  not  humanely  possible. 

Our  objecdre  is  to  design  an  automatic  tool  to  increase  the  SATA 
efficiency.  This  paper  presents  our  work  with  the  folloAving  framework. 
Section  2  exposes  two  classical  approaches  for  intrusion  detection. 
Section  3  introduces  an  alternative  approach  based  on  attack  scenarios. 
In  section  4  we  give  our  own  vision  of  the  securify  audit  trail  analysis 
problem  which  is  proved  to  be  NP-complete.  Section  5  proposes  a  brief 
survey  of  genetic  algorithms  (GA).  In  section  6  we  derive  a  simplified  but 
still  NP-complete  version  of  SATA  and  show  how  to  apply  GAs  to  it. 
Section  7  discusses  our  experiments  which  exhibit  fairly  good  results. 
Section  8  highlights  some  remaining  problems  of  our  approach  and 
section  9  concludes  the  paper. 
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2.  Two  Classical  Approaches 

Intruders  can  attack  a  system  using  either  unknown  or  known  techniques. 
In  the  first  case,  a  possible  strategy  for  the  SO  to  detect  intrusion  is  the 
use  of  a  statistical  approach  (reference  [3]  uses  neural  networks)  .  The  SO 
has  to  respond  to  the  question  "is  the  users  behavior  normal  according  to 
the  past?".  This  approach  is  known  as  "the  comportemental  model" 721. 
In  the  second  case,  it  is  possible  to  provide  attack  detection  rules  which 
enables  the  SO  to  use  expert  systems  tools.  The  question  is  "does  the  user 
behavior  correspond  to  a  known  attack?".  Some  tools,  and  especially  the 
most  famous,  IDES  (Intrusion  Detection  Expert  System)  [4]  [5], 
implement  both  approaches.  Others  rely  on  expert  systems  exclusively. 
Some  tools  already  provide  quasi-real-time  intrusion  detection. 

The  statistical  approach  leads  to  some  problems:  (i)  the  choice  of  the 
parameters  of  the  statistical  model  is  tricky,  (ii)  the  statistic  model  leads 
to  a  flow  of  alarms  in  the  case  of  a  noticeable  systems  environment 
modification  and  (iii)  a  user  can  slowly  change  his  behavior  in  order  to 
cheat  the  system. 

In  the  expert  system  approach,  the  SOs  knowledge  is  encoded  in  a  set 
of  rules  used  to  analyze  the  audit  log.  In  practice,  however,  the  SO  has 
only  gained  limited  expertise,  essentially  because  the  huge  amounts  of 
data  recorded  leads  to  an  untractable  duty. 


3.  An  Alternative  Approach 

A  third  approach,  "model-based  reasoning",  is  recommended  as  an 
additional  intrusion  detection  system  by  Teresa  Lunt  and  Thomas  Garvey 
[6].  It  consists  of  designing  attack  scenarios  as  sequences  of  user 
behavior.  These  behavior  sequences  are  then  translated  (depending  on 
the  audit  system  used)  into  audit  events’  sequences  (for  example  the 
password  copy  activity  is  translated  into  an  execute  access  to  /oi  n/cp 
and  a  read  access  to  /etc/passwd).  Using  attack  scenarios  has  some 
advant^es: 

1.  the  SO  can  design  the  attack  scenarios  himself  according  to  the 
threats  he  is  afraid  of, 

2.  the  modification  of  an  existing  scenario  or  the  addition  of  a  new 
scenario  (e.g.  after  its  detection  by  a  statistical  method)  is  easy, 

3.  the  events  to  be  stored  are  only  those  present  in  at  least  one  scenario, 

4.  the  SO  can  attribute  to  each  scenario  a  weight,  according  to  the 
consequences  of  the  corresponding  attack. 

We  use  this  approach  for  our  work.  Let  us  consider  the  attack 
scenarios  as  sequences  of  audit  events.  A  method  to  simplify  the  design  of 
the  scenarios  must  be  found.  We  propose  the  following  requirements  for 
this  method: 

1 .  it  should  spare  the  designer  the  trouble  of  enumerating  all  the  possible 
variants  of  the  same  scenario 

2.  it  should  allow  the  shortest  expression  of  any  scenario,  especially 
when  it  contains  repetitive  patterns  (e.g.  denial  of  service) 

3.  it  should  be  as  general  as  possible  in  order  to  allow  the  design  of  any 
scenario. 

It  is  useful  to  refer  to  the  works  done  in  the  field  of  pattern  matching: 
the  set  of  audit  events  can  be  seen  as  an  alphabet,  each  audit  event  as  a 
character,  the  audit  trail  as  a  main  string  and  the  scenarios  as  sub-strings 


to  locate  in  this  main  string.  To  reach  the  previous  reouirements,  we 
chose  regular  expressions  with  back  referencing  (rewbr)  [7]  as  a  language 
to  design  the  attack  scenario.  The  referencing  operator  allows  the  design 
of  any  scenario. 

In  this  way,  the  "security  audit  trail  analysis  problem"  (SATAP) 
becomes  similar  to  a  "finding  a  rewbr  in  a  string"  problem  (FRSP).  The 
NP-Completeness  of  FRSP  [7]  makes  classical  algorithms  quite 
impossible  to  apply  to  real  audit  logs  (recall  the  huge  amount  of  data 
recorded). 


4.  A  More  Precise  Expression  of  the  SATA  Problem 

If  the  SATA  was  made  attack  by  attack  (i.e.  rewbr  by  rewbr),  exclusive 
attacks  could  be  declared  present  at  the  same  time.  To  avoid  this 
problem,  we  have  to  consider  the  whole  set  of  attack  scenarios  in  a  single 
analysis.  We  have  to  determine,  among  all  the  possible  attacks’  sub-sets, 
the  one  which  presents  the  greatest  risk  to  the  system.  For  this,  we 
suppose  that  the  SO  is  able  to  evaluate  the  risk  inherent  in  each  scenario. 
Consequently  we  attribute  to  each  scenario  a  weight  proportional  to  that 
risk.  By  default,  these  weights  are  all  equal  to  1  and  we  simply  look  for 
the  biggest  possible  sub-set. 

More  formally,  our  approach  of  SATAP  can  be  expressed  by  the 
following  statement:  A  is  an  alphabet  whose  letters  are  auditable  events, 
S  is  a  set  of  attack  scenarios  expressed  by  rewbr  made  with  A’s  letters 
(each  scenario  Si  is  associated  to  a  weight  Wi),  T  is  the  audit  trail  which  is 
to  be  analyzed  and  viewed  as  a  string  of  A’s  letters;  We  have  to  find  the 
sub-set  S’  of  S  so  that  the  total  weight  is  maximized  and  so  that  each 
rewbr  of  S  matches  a  different  sub-stnng  of  T. 

Theorem  1.  SATAP  is  NP-complete. 

Proof.  If  there  are  n  different  rewbr  in  S,  there  are  M  possible  non-empty 
sub-sets  of  S.  If  we  ignore  the  order  of  the  successive  attacks,  we  have: 

n  i  n 

M  =  Z  C  =2-1  (1) 

i=l  n 

The  number  of  possible  sub-sets  increases  exponentially  with  the 
number  of  potential  attacks.  Each  sub-set  relates  to  a  hypothesis 
corre^onding  to  the  attacks  which  would  be  actually  present  in  the  audit 
trail.  However,  not  all  the  hypotheses  are  realistic.  So  for  each  of  the  2"-l 
possible  sub-sets,  we  must  make  a  decision  about  the  realism  of  the 
corresponding  hypothesis.  For  this,  we  have  to  solve  from  1  to  n  FRSP 
problems  (one  for  each  element  of  the  sub-set  until  there  is  no  match  in  T 
for  a  particular  element  of  the  sub-set).  To  solve  SATAP,  it  is  necessary 
to  solve  at  least  2”-l  FRSPs.  We  have  here  linked  NP-complete 
problems. 


5.  A  Search  Algorithm  for  the  SATA  Problem 

The  NP-Completeness  of  SATAP  makes  it  quite  impossible  classical 
algorithms  to  apply  to  real  audit  logs  (recall  the  huge  amount  of  data 
recorded).  So  we  propc'ie  to  use  an  heuristic  method,  the  so  called 
"genetic  algorithm  . 

Genetic  algorithms  (GA)  [8],  proposed  by  Holland  (1975)  [9],  are 
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optimum  search  algorithms  based  on  the  mechanism  of  natural  selection 
in  a  population. 

A  population  is  a  set  of  artificial  creatures  (individuals  or 
chromosomes).  These  creatures  are  strings  of  length  1  coding  a  potential 
solution  to  the  problem  to  be  solved,  most  often  with  a  binary  alphabet. 
The  size  L  of  the  population  is  constant.  The  population  is  nothing  but  a 
set  of  points  in  a  search  space. 

The  population  is  randomly  generated  and  then  evolves:  in  every 
generation,  a  new  set  of  artificial  creatures  is  created  using  the  fittest  or 
pieces  of  the  fittest  individuals  of  the  previous  one.  The  fitness  of  each 
individual  is  simply  the  value  of  the  function  to  be  optimized  (the  fitness 
function)  for  the  point  corresponding  to  the  individual.  The  iterative 
process  of  population  creation  is  achieved  by  three  basic  genetic 
operators  [9j:  selection  (selects  the  fittest  individuals),  reproduction 
(promotes  exploration  of  new  regions  of  the  search  space  by  crossing 
over  parts  of  individuals)  and  mutation  (protects  the  population  against 
an  irrecoverable  loss  of  information).  The  general  structure  of  a  GA  is 
thus  the  following: 

Random  generation  of  the  first  generation 

Repeat 

Individual  Selection 
Reproduction 
Mutati on 

Until  an  individual  outclasses  others 

Genetic  operators  are  randomized  ones  but  genetic  algorithms  are  no 
simple  random  walk:  they  efficiently  exploit  historical  information  to 
speculate  on  new  search  points  with  expected  improved  performance. 

Because  of  space  constraint,  we  cannot  expand  this  section.  For 
more  developments,  please  refer  to  the  literature  [8]  [9]  [10]  [1 1]  [12]. 


6.  Security  Audit  Trail  Analysis  Using  Genetic  Algorithms 

Two  sub-problems  arise  when  applying  GAs  to  a  particular  problem:  (i) 
coding  a  solution  for  that  problem  with  a  string  of  bits  and  (ii)  finding  a 
fitness  function  to  evaluate  each  individual  of  tne  population.  To  satisfy 
these  two  requirements,  we  had  to  simplify  our  vision  of  SATAP. 

6.1  A  SimpUfied  Vision  of  SATAP 

As  seen  previously,  we  have  linked  NP-complete  problems:  for  each 
possible  sub-set  among  2"-l  and  for  each  element  of  the  sub-set  find  a 
different  matched  string  in  the  audit  trail.  Our  goal  is  to  determine, 
among  all  the  possible  attacks’  sub-sets,  the  one  which  presents  the 
greatest  risk  to  the  system:  we  adopt  a  simplified  scheme  which  consists 
of  brassing  the  matching  problem. 

To  make  a  decision  about  the  realism  of  the  hypothesis 
corresponding  to  a  particular  sub-set,  we  propose  to  work  on  the  events 
rather  than  on  the  attacks.  This  means  that  we  count,  for  a  particular 
attacks’  sub-set,  the  number  of  events  of  each  twe  generated  oy  all  the 
attacks.  If,  for  one  or  more  types,  this  number  is  less  than  or  equal  to  the 
number  of  recorded  events  of  that  type,  then  the  hypothesis 
corresponding  to  the  sub-set  is  realistic.  It  is  a  way  to  achieve  a 
hypothetico-deductive  scheme  [13|,  just  like  a  human  expert  would 
probably  do  for  SATA:  a  hypothesis  is  made  (e.g.  among  the  set  of  18 


possible  attacks  (figure  1),  the  attacks  3,  7  and  12  are  present)  and  the 
deduction  involves  an  evaluation  of  the  h^othesis  (in  our  approach, 
deduction  is  enforced  through  the  fitness  function,  as  presented  in  the 
remainder  of  this  paper).  According  to  this  evaluation,  an  improved 
hypothesis  is  tried,  until  a  solution  is  found. 

This  simplified  vision  of  SATAP  (SSATAP)  implies  the  translation 
(with  a  linear  one-pass  algorithm)  of  the  audit  trail  into  an  observed  audit 
vector  O  (a  real-time  construction  of  O  could  be  considered).  Basically, 
Oi  counts  the  number  of  i  type  events  present  in  the  audit  trail.  This 
translation  results  in  the  loss  of  the  time  sequence.  This  presents  some 
difficulties  but,  in  addition,  avoids  the  problem  of  events  reordering 
which  is  not  easy  when  timing  information  gained  from  the  audit  sub¬ 
system  is  not  precise  enough.  In  the  case  of  network  audit  trail,  where  a 
global  time  does  not  exist,  this  could  be  an  advantage.  In  some  cases,  it 
could  be  useful  to  apply  other  building  rules  for  O.  Oi  then  counts  the 
number  of  sequences  of  particular  events. 

6.2  Proof  of  the  NP-completeness  of  SSATAP 

Formally,  SSATAP  can  be  expressed  by  the  following  statement: 

-  let  Ng  be  the  number  of  audit  events  and  the  number  of 
potential  attacks 

-  let  AE  be  an  N  x  attacks-events  matrix  which  gives  the  set  of 
events  generated  by  each  attack.  AE.j  is  the  number  of  audit  events 
of  type  i  generated  by  the  scenario  j^(AEj>0).  (See  section  7.2  for 
an  example  of  such  a  matrix) 

-  let  W  be  a  N  ^-dimensional  weight  vector,  where  Wj  (W|>0)  is  the 
weight  associated  with  the  attack  i  (W|  is  proportional  to  the  risk 
inherent  in  the  attack  scenario  i) 

-  let  O  be  the  N  -dimensional  observed  audit  vector  defined  in  the 

e 

previous  section 

-  let  H  be  a  N^-dimensional  hypothesis  vector,  where  H:  equals  1  if 
the  attack  i  is  present  according  to  the  hypothesis  and  Hj  equals  0 
otherwise  (H  describes  a  particular  attacks’  sub-set). 

SSATAP  consists  in  finding  the  H  vector  which  maximizes  the  W.H 
product,  subject  to  the  constraint  (AE.H);  <  O;  (l<i<N„). 

Theorem  2.  SSATAP  is  NP-complete. 

Proof.  SSATAP  can  be  polynomially  reduced  to  the  zero-one  integer 
programming  problem  (ZOIP).  ZOIP  can  be  expressed  by  the  following 
statement  [14]: 

Instance'.  A  finite  set  S  of  pairs  (X,b),  where  X  is  an  m-tuple  of 
integers  and  b  is  an  integer,  an  m-tuple  C  of  integers  and  an  integer 
B. 

Question'.  Is  there  an  m-tuple  Y  of  integers  such  that  X.Y<b  for  all 
pairs  (X,b)  and  such  that  C.Y>B? 

SSATAP  can  be  directely  reduced  to  ZOIP  by  writing: 
number  of  pairs  (X,  b)  in  the  S  set  =  N 
m  =  N 

a 

X  =  (aej  j ,  ae.2,  . . .  ae^j^^)  (a  line  of  the  AE  matrix) 
b  =  0.  ‘  ‘ 
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B  =  the  extremum  of  the  fitness  function 
Y  =  H 

ZOIP  is  NP-complete  [14].  Therefore  SSATAP  is  NP-complete. 


6.3  GASSATA,  a  Genetic  Algorithm  for  Simplified  Security  Audit  Trail 
Analysis 


Reference  [IS]  gives  an  example  of  a  genetic  algorithm  solving  an 
NP-complete  problem  similar  to  ZOIP,  the  set  covenun  problem  (SCP). 
Liepins  and  Potter  studied  in  1991  a  genetic  approach  for  multiple-fault 
diagnosis  [1^  based  on  SCP.  We  used  these  two  papers  as  a  starting 
point  for  GASSATA. 

6. 3. 1  Coding  a  solution  to  SSA  TA  with  a  binary  string 

Recall  that  an  individual  is  a  1  length  string  coding  a  potential  solution 
to  the  problem  to  be  solved.  In  our  case,  the  coding  is  straightforward: 
the  length  of  an  individual  is  N  and  each  individual  in  the  population 
corresponds  to  a  particular  H  ve^or  as  defined  in  the  previous  section. 


6. 3. 2  The  Fitness  Function 

We  have  to  search,  among  all  the  possible  attacks’  sub-sets,  for  the  one 
which  presents  the  greatest  risk  to  the  system.  This  results  in  the 
maximization  of  the  product  W.H.  As  GAs  are  optimum  search 
algorithms  for  finding  the  maximum  of  a  so-called  fitness  function,  we 
can  easily  conclude  that  in  our  case  this  function  should  be  made  equal  to 
the  product  W.H.  So  we  have: 


Fitness  =  W,-  .  I.- 

i=l  ^  ^ 


(2) 


where  I  is  an  individual. 

This  fitness  function  does  not,  however,  pay  attention  to  the 
constraint  feature  of  SSATA  which  implies  that  some  hypotheses  (i.e. 
some  individuals)  among  the  2^^-l  possible  ones  are  not  realistic.  This  is 
the  case  for  some  i  type  of  events  when  (AE.H)j>  Oj.  There  are  several 
ways  to  take  a  constraint  into  account  with  GAs:  (i)  modifying  the 
genetic  operators  so  that  they  only  generate  "good"  individuals  (i.e.  with 
respect  to  the  constraint),  (ii)  repeating  each  crossover  or  mutation 
process  until  a  "good"  individual  is  generated,  (iii)  penalizing  the  "bad" 
individuals  by  reducing  their  fitness  value.  When  a  large  number  of 
individuals  do  not  respect  the  constraint  (this  is  precisely  our  case)  the 
third  solution  is  the  best  one. 

To  reduce  the  fitness  value  for  a  "bad"  individual,  we  compute  a 
penalty  function  (P)  which  increases  as  the  realism  of  this  individual 
decreases :  let  Te  be  the  number  of  types  of  events  for  which  ( AE .  H)j  >  Oj, 
the  penalty  function  applied  to  such  an  H  indmdual  is  then: 

P  =  TeP  (3) 


A  quadratic  penalty  function  (i.e.  p=2)  allows  a  good  discrimination 
among  the  individuals.  The  proposed  fitness  function  is  thus  the 


following: 


Fd^-)  =  a  +  [  .1®  W^.I^  -  p.Te^  ]  (4) 

The  3  parameter  makes  it  possible  to  modify  the  slope  of  the  penalty 
function  and  a  sets  a  threshold  making  the  fitness  positive.  If  a  negative 
fitness  value  is  found,  it  is  equaled  to  0  and  the  corresponding  individual 
will  die.  So  the  a  parameter  allows  the  elimination  of  unrealistic 
hypotheses. 

6. 3. 3  The  genetic  operators 

We  are  using,  at  the  moment,  the  three  basic  operators  defined  by 
Holland  [9]. 


7.  Experimental  Results 

We  give  in  this  section  a  brief  survey  of  the  IBM*  AIX^  security  audit 
system.  Then,  we  describe  the  attacks-events  matrix.  Lastly,  we  present 
our  experiments  and  their  results. 

7.1  IBM  AIX  secoiity  audit  system 

The  AIX  security  audit  system  [17]  [18]  [191  is  designed  to  satisfy  the 
C2  security  requirements  of  the  orange  book  [20].  Some  audit  events  are 
generated  by  the  AIX  kernel  subroutines.  Other  events  can  be  defined  by 
the  SO  (some  are  proposed  by  default).  These  later  events  can  be 
associated  with  objects  (files)  for  a  write,  a  read  or  an  execute  access. 
Both  types  of  events  can  be  associated  with  subjects  (users).  For  our 
experiments,  we  use  25  kernel  or  self-defined  events  [21]. 

Events  are  recorded  in  a  protected  trail.  The  audit  trail  records 
format  is  the  following:  event  name,  status  (OK  or  FAIL),  real  user  id, 
login  user  id,  program  name  (the  command  which  generated  the  event), 
process  id,  parent  process  id,  time  in  seconds,  record  tail  (contains 
information  depending  on  the  event  such  as  the  file  name  in  case  of  a  file 
opening  event). 

7.2  The  Attadcs-Events  Matrix 

In  practice,  we  encounter  problems  designing  consistent  attacks  and  thus 
we  consider  rather  short  ones.  We  add  in  the  attacks  set  some  "suspicious 
actions"  such  as  repeted  use  of  the  "who"  command. 

An  attack  is  a  non  ordering  set  of  audit  events  which  happened 
during  an  audit  session.  We  work  with  30  minutes  audit  sessions.  This 
represents  about  85  kilo-octets  for  each  audited  user  (users  are  software 
developers  for  the  moment).  We  translate  the  audit  trail  into  user-by-user 
audit  vectors  with  a  linear  one-pass  algorithm.  Successive  audit  trails  and 
audit  vectors  should  be  archiveo  on  tapes  for  possible  future 
investigations. 

Figure  1  shows  the  attacks-events  matrix  that  we  use  for  our 
experiments  ("."  taking  the  place  of  "0"  for  lisibility  reasons).  In  that 


1,2.  IBM  and  AIX  are  trademarks  of  International  Business  Machines  Corporation. 
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matrix  each  column  corresponds  to  an  attack.  For  example,  column  8 
corresponds  to  a  browsing  attack  which  can  be  characterized  by  a  high 
rate  of  use  for  "Is"  and  "more"  or  "pg"  commands  (in  our  case  10  of  each 
during  the  30  minutes  of  the  audit  session)  [21]. 


passwd  read 
group  read 
hosts  read 
opasswd  read 
ogroup  read 
fail  sensitive 
fi  1  es  wri  te 
fail  day  login 
night  login 
su  command 
who,  w,  df ,  ... 
fail  Is  command 
Is  command 
cp  command 
rm  command 
whoami  command 
more  or  pg  cmd 
passwd  command 
fail  chmod  cmd 
fail  chown  cmd 
fail  file  open 
file  deletion 
process  creation 
process  exec, 
process  priority 
modi fication 
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Figure  1 .  An  Attacks-Events  Matrix  for  GASSATA 


7.3  Experiments  and  Results 

All  our  experiments  are  made  using  the  following  parameters  for  the 
fitness  function:  a=50,  (5=1  and  p=2.  Each  experiment  can  be 
characterized  by  a  5-tuple  (Pc,  Pm,  L,  g,  a)  where  Pc  is  the  crossover 
probability.  Pm  is  the  mutation  probability,  L  the  population  size,  g  the 
number  of  generations  and  a  the  number  of  attacxs  actually  present  in 
the  audit  trail.  The  default  values  for  these  parameters  are  Pc=0.6, 
Pm=0.0083,  L=100,  g=100  (they  correspond  to  classical  values  when 
using  GAs)  and  a =2. 

Because  of  their  non-independence,  we  vary  each  parameter 
separately  by  taking  values  from  the  following  sets  : 

Pc . . . (0.5,  0.7.  0.8,  0.9,  1.0) 

Pm...(0.  0.00166.  0.00332.  0.00498.  0.00664.  0.00996. 

0.01162.  0.01328.  0.01494.  0.0166) 

L _ (20.  50.  150) 

g  . . . . (1000) 

a  _ (0.  1.  5.  10.  15.  18) 

The  5-tuple  (Pc,  Pm,  L,  g,  a)  thus  takes  25  different  values.  For  each 
of  these  values,  we  perform  10  runs  (all  the  following  results  are  averages 
over  the  10  runs): 

-  for  each  generation,  we  compute  the  minimum,  maximum  and 
average  whole  population  fitness  values 

-  when  the  GA  stops,  we  count,  for  each  bit  position  i  along  the 
strings  of  the  final  population,  the  number  of  1  values  and  compute 
a  rate  ti  by  dividing  this  number  by  L.  When  ti  is  greater  than  or 
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equal  to  a  given  detection  threshold  0,  the  i'^  attack  is  declared 
present. 

We  define  four  rates  Tl,  T2,  T  ’  1  and  T  ’  2  as  follow: 

-  Tl  is  the  number  of  detected  present  attacks  out  of  the  number  of 
present  attacks 

-  T2  is  the  number  of  detected  not  present  attacks  out  of  the  number 
of  not  present  attacks 

•  T  ’  1  is  the  number  of  individuals  in  which  bits  corresponding  to 

? resent  attacks  are  1  out  of  the  total  number  L  of  individuals 
‘  2  is  the  number  of  individuals  in  which  bits  corresponding  to  not 
present  attacks  are  1  out  of  the  total  number  L  of  individuals 
Tl  and  T2  are  respectively  the  detection  rate  and  the  false  alarm  rate. 
They  depend  on  0.  Our  tool  should  present  a  high  rate  of  detection  and  a 
low  rate  of  false  alarm.  T '  1  and  T  *  2  qualify  the  results  of  the  GA.  T '  1 
should  be  equal  to  1  and  T  ’  2  to  0. 

7. 3. 1  Minimum,  Maximum  and  Average  Fitness  Values 

The  maximum  fitness  value  converges  quickly  on  the  optimum  (after 
about  20  generations  with  the  default  parameters,  figure  2).  The 
remainder  of  the  population  follows  and  after  100  generations,  the 
averse  fitness  is  about  9/10  of  the  maximum  fitness.  If  Pc,  Pm  or  L  grows 
GASSATA  converges  more  slowly  but  the  risk  to  reach  a  local  optimum 
(due  to  a  so  called  "premature  convergence"  [8])  decreases.  We  observe 
that  if  a>lS,  the  number  of  generations  has  to  be  increased  in  order  to 
insure  GA  convergence. 


7.3.2  Tl  andT2 

Figure  3  shows  the  evolution  of  T 1  and  T  2  versus  the  detection  threshold 
0  for  the  default  parameters.  It  shows  that  O.S  is  a  good  value  for  0  as  we 
then  have  Tl=l  and  T2=0.  We  observe  that  it  is  always  the  case  with  thv; 
default  parameters.  If  Pc  or  Pm  increases,  the  optimal  0  value  decreases. 
If  L  or  g  increases,  the  optimal  0  value  increases. 


8.  Remaining  Problems 

We  must  mention  some  remaining  problems,  which  arise  due  to  the  use 
of  prediflned  attack  scenarios  or  to  the  use  of  our  simplified  vision  of 
SATAP. 


8.1  Problems  due  to  the  Predefined  Attack  Scenarios  Approach 

In  practice,  the  desien  of  consistent  attack  scenarios  is  difficult.  For 
security  reasons,  information  on  the  subject  is  often  kept  secret. 
Nevertheless,  note  that  our  approach  allows  an  easy  addition  of  any 
scenario,  according  to  the  security  officer’s  knowledge. 

The  translation  of  intrusive  behavior  into  audit  events’  sequences  or 
sets  requires  a  very  good  knowledge  of  the  system  kernel.  Sometimes,  this 
knowledge  is  not  easy  to  gain. 

8.2  Problems  due  to  the  Simplified  Vision  of  SSATA 

The  translation  of  the  audit  trail  into  an  observed  audit  vector  implies  the 
loss  of  the  time  sequence.  It  is  a  major  problem  that  we  must  solve.  For 
that  we  plan  to  formalize  SATAP  as  a  graph  contractability  problem.  No 
work  has  been  done  in  that  area  for  the  moment. 

By  using  binary  coding  for  the  individuals,  we  cannot  detect  the 
multiple  re^ization  of  a  particular  attack.  As  a  consequence,  we  should 
try  non-binary  GAs.  In  that  case,  the  GA  execution  time  should  grow 
because  of  the  search  space  extension. 

GASSATA  finds  tne  H  vector  which  maximizes  the  W.H  product, 
subject  to  (AE.H):<0;  (l<i<N  ).  If  the  audit  session  is  too  long,  this 
constraint  is  always  enfbrcea  and  GASSATA  converges  on  the 
N -dimensional  unit  vector.  To  avoid  this  problem,  the  duration  of  the 
aiRlit  session  should  be  choosen  carefully.  In  our  case,  we  work  with  30 
minutes  audit  session. 

If  the  same  event  or  group  of  events  occurs  in  several  attacks,  an 
intruder  realizing  these  attacks  simultaneously  does  not  duplicate  this 
event  or  group  of  events.  In  that  case,  GASSATA  fails  to  find  the  optimal 
H  vector.  We  have  no  solution  to  that  problem  for  the  moment.  This 
means  that  we  only  consider  independent  attacks. 

To  end  this  section,  it  should  also  be  stated  that  GASSATA  does  not 
locate  attacks  in  the  audit  trail.  Just  like  statistical  tools,  it  only  gives  a 
presumptive  set  of  attacks  and  the  audit  trail  must  be  investigated  latterly 
by  the  security  officer  to  precisely  locate  the  attacks. 


9.  Conclusion 

The  experiments  (section  7)  show  good  results  for  GASSATA  which 
validates  the  genetic  approach  for  security  audit  trail  analysis: 

-  if  0=0.5,  thenTl=l  and  T2=0  in  every  cases;  the  high  rate  of 
detection  and  the  low  rate  of  false  alarm  requirements  are  satisfied 

-  the  minimal  value  for  T  ’  1  is  about  0.6  and  the  maximum  value  for 
T’  2  is  about  0.15  with  the  default  configuration:  it  shows  a  quite 
good  convergence  of  the  GA  which  explains  the  previous  results. 

If  we  let  the  GA  run  for  a  given  number  of  generations,  the  execution 
time  is  constant  for  any  audit  vector  (note  that  this  is  not  the  case  for  the 
audit-trail-into-audit-vector  translation  time).  We  observe  that  the  GA 
always  converges  if  we  let  it  run  for  100  and  if  a<10:  in  this  case  the 
execution  time  is  9  seconds  with  an  IBM  RS6000  1 1.7  Mflops.  For  higher 
a  values,  we  must  increase  the  number  of  generations. 

Nevertheless,  some  problems  remain  (section  8)  which  motivate 
future  studies  especially  for  including  timing  aspects  in  GASSATA. 

Let  us  finish  by  noting  that  we  see  G^SATA  as  nothing  more  than 
an  additional  tool  in  the  set  of  security  officers  intrusion  detection  tools. 
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Abstract 

This  paper  focuses  on  the  key  geoeratKHi  proMon  for  a  modified  RS  A 
public  1^  cryptographic  system  based  on  the  RNS  arithmetic.  The 
RNS  bas^  modificatian  of  the  wdl  known  RSA  algorithm  uses 
highly  paralld  oomputatiop  with  die  restrictkm  that  onty  a  subset  of 
key  trades  iD.ehey.dkey)  of  a  amvcntional  RSA  system  can  be 
adopted.  These  restrictiops  result  fimn  the  dunce  of  base  elements 
used.  The  present  wodc  shows  that  the  remaining  set  of  possible  keys 
is  still  large  enou^  to  be  used  in  a  realistic  oyptograidiicqystem.  The 
encryptiai  madiine  under  discusskm  can  use  parallelism  and  thus 
hi^  qieed.  A  rather  straight  finward  algorithm  for  the  generatxm  of 
keys  can  be  given.  The  resulting  k^  ^noe  can  be  viewed  as 
satisfoctny.  The  method  gives  an  additic^  d^ree  of  fieedom  for  the 
irqdementatkm  <m  paralld  i^sterns  avoiding  all  conversions  between 
number  sjrstems. 


1  Introduction 

Paralld  cwnputation  of  the  RSA  algorithm  is  well  known  to  be  complicated  [1]. 
This  resnUs  fircnn  the  £act  that  the  central  operation  invdved  (modulo  arithmetic) 
demands  fin  intensive  confutation  with  extremely  hi^  oonmnmicatkm.  Fran  the 
theoretical  point  d  view  some  algorithms  would  outline  a  solutkm  [2][3].  All  these 
methods  ate  too  intensive  in  communication  and  not  feasiUe  as  a  parallel  chif 
design  since  the  dup  area  is  limited.  Available  results  therefine  usually  get  studr 
around  the  SOKbit/sec.  perfimnanoe  range  with  a  512  bit  key  length,  not  considering 
the  Caiinese  remainder  theorem  {4].  Using  top  technologies,  like  silicon  on  insulator 
in  sub  micrai,  designs  can  improve  these  roults  mify  by  a  factor  of  two.  However, 
as  can  be  seen  in  presoited  designs,  parallelism  is  not  extenavely  ejfldted  [5], 

The  dilemma  is  twofdd.  Rrst,  highly  paralld  designs  on  a  sing^  dif  ate 
impossible  due  to  limited  dup  sizes.  Second,  cutting  into  naany  chips  catmot  be 
maiMged  due  to  extensive  oommiinicatioiL  In  ptevkms  prcgects  h  has  been  shown 
that  both  defidendes  can  be  fin^ht  somdiow.  Scalable  paralldism  is  one  qpedal 
answer  that  enabks  the  designer  to  use  if  available  cfaif  area  and  to  gain 


342 


enayptioo  rates  in  the  aiea  cf  200  to  300  Int/sec.  with  l^  CMOS  technology  [6]. 
RNS  aiithnietic  could  be  an  answer  to  the  distribution  of  the  processing  power  ova 
many  diqis  {?].  The  seccnid  method  also  reliefs  the  heat  diss^mtkm  problem  vducfa 
would  alro  arise  on  highly  parallei  single  ch4>  designs  finr  RSA.  The  technique  that 
bases  mainly  on  RNS  base  extension  can  be  used  to  constnict  a  pid>lic 
oyptogrqihic  system.  This  method  is  equivalent  to  RSA  in  terms  of  security. 
However  it  restricts  on  the  key  space  that  can  be  used. 

Restriction  on  the  space  is  due  to  the  RNS  arithmetic  and  the  restrictions  that 
have  to  be  made  on  the  RNS  base  elements  to  ensure  efficient  computatkm  of  the 
algorithm  These  resttKtkms  are  discussed  in  the  presented  paper  and  an  algorithm 
to  coostiuct  RNS  base  elements  and  key  pairs  is  jnesented.  The  paper  also  contains 
a  sketdi  of  the  algorithm  itself  and  of  the  underlying  register  mianed  arithmetic. 


2  The  encryption  machine 

For  the  RSA  •  like  algorithm,  cipher  =  pUxuf^  MODZ>  has  to  be  conqmted  [9]. 
This  can  be  done  using  Knnth's  square  and  nmltipty  algorithm  [10].  This  frot  turns 
the  modulo  esqxmentiation  to  hteak  down  to  modulo  multiplication  MMUL. 
Basically  this  would  invrdve  multiply  and  divide  operations.  For  reasmis  cS 
perfornumoe  and  unifimnity  of  operations  whidi  is  of  prime  inqxxtanoe  in  an  VLSI 
design  it  is  most  desiraMe  to  avoid  the  divide  operation.  This  can  be  done  by  the 
FASTMM  £mt  modulo  multiidication  as  descriM  in  [3].  As  described  in  mote 
detail  m  [7]  RNS  can  load  a  set  of  processing  etements  more  equally  and  thus  call 
for  less  communication  on  the  single  processing  element  when  combined  with 
Mongtomcty*s  rednctkm  [2].  The  basic  idea  is  to  substitute  the  radix  Ity  a  product 
* 

N  =  n  A'  ^  base  elements.  This  reduces  the  modulo  multiplication  to  two 
base  exiensians  as  shown  in  figure  1. 

The  mediod  vriudi  is  described  in  full  detail  in  [14]  leaves  an  mqxocessed  foctor 
with  eadi  modulo  multi|dication  stq).  As  the  number  of  multiplications  is  constant 
far  a  given  key  this  foctor  in  total  is  also  constant  and  conqjensation  can  be  achieved 
by  a  single  multiplication.  This  multiplication  is  done  with  the  sdieme  shown  in 
figure  1.  Again  a  fiKtor  would  be  introduced.  Taking  this  last  foct  into  account  a 
correcting  fiKtor  can  be  precomputed  which  if  multiplied  after  the  enayptim 
process  compensates  for  all  errors. 

Base  extension  itself  is  nontrivial  and  good  methods  in  log  time  are  known  onty  if 
contrd  base  elements  are  available.  This  is  definitely  not  the  case  fin  the  Z  NKX>  N 
step  of  figure  1.  In  [14]  an  qiproximation  is  present  This  approximation  is  exact 
if  the  initial  value  and  thus  the  plain  text  offsets  at  least  by  some  A  fiom  zero.  This 
in  turn  also  means  that  restricting  rehnions  between  D  and  the  A/ and  N  values 
should  hold: 


(a)  D{l+eo)<N<D+^ 

(b)  4D<M<4MD. 

(c)  M=4Ni  4N<M 

Restnctkn  (a)  results  fmn  the  &ct  that  modulo  redncticm  is  achieved  with  results 
out  of  [0^1-t-e|^].  Here  £/)  can  be  kept  small  but  depends  on  the  nuniber  of 
used  in  the  arithmetic.  Typically  e/^  <  1/1000  is  easy  to  get  Thus,  the  assunqitkm 
D  <  A/ can  be  used  for  fotdier  discussion.  Restriction  (b)  and  (c)  result  firom  tte 
interval  decision  process  during  the  rednctkm  described  in  [14]. 

M  N  =  = 


ZMODAT  I  Z=(i4MOD  A/)D  ‘ 


M-Z2)A/-‘|  i  W  I 

Figure  1:  Base  extension  as  substitate  for  modulo  niulti]dicatkm. 

The  ronghly  outlined  method  can  be  processed  on  the  roister  level  of  the  RNS  base 
in  all  steps. 

Buildiag  an  enayption  machine  needs  coonecting  these  neater  oriented  dements. 
Opthnnm  paralldism  would  be  reached  if  a  divide  and  conquer  method  would  be 
involved  eten  oornwting  the  ooovolutian  sums.  Coovolntian  sums  are  basically 
needed  in  the  base  extension  process  used  with  this  work  [7].  This  however  would 
fVnrmifi  pjr  many  elements,  Therefore  a  bos  connected  processor  set  is 

asnimed  to  be  more  practical  still  giving  good  results.  The  result  is  a  tedmically 
feasible  paralldisnL  Convolution  sums  in  this  context  have  to  be  computed  with 
modulo  arithmetic  within  the  registers. 
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Figure  2:  A  posafole  structure  for  a  parallel  ciphering  machine. 


A  piactical  feasiMe  ocmfigiiratkm  is  outlined  in  figure  2.  Serial  connection  and 
oonununication  is  assumed  for  distributuMi  of  omstants  and  plain  text  feeding  as 
well  as  retrieval  of  the  cqdier  text  The  main  feature  of  such  a  oonfigutalkm  is  the 
possiUe  distribution  of  {vocessing  over  a  set  of  chips.  The  to  this  wmtrivial 
issue  is  low  oonununication  as  compared  with  other  methods. 


3  The  modified  algorithm  with  efficient  register 
arithmetic 

This  paragraph  focuses  on  the  u-b+c  MODp^  that  has  to  be  computed  in  each 
processing  dement  along  with  the  oonvolutuxL  It  is  included  to  point  at  the 
necessity  to  restria  on  the  set  of  possfele  keys.  The  basic  idea  is  to  use  only  those  p, 
etements  that  can  easily  be  managed  Besides  the  pt  =  2^  nhere  modulo 

multiplication  is  just  a  cut  on  the  binary  digits,  a  pt  that  can  be  written  9spi  = 
with  Pi  having  a  low  1-bit  count,  is  convenient  fin  processing  as  well. 

Starting  with  the  Wallace  scheme  the  result  is  a  redundant  rqrresenled  number  and 
can  be  processed  as  shown  in  figure  3  using  the  assumption  in  a  first  phase 
cutting  the  2r  bits  down  to  r+/og2(Pi>^2. 


r 


Figure  3:  The  first  phase  of  +q  MOD  pj. 


A  second  phase  is  involved  giving  the  unique  result  out  of  the  interval  [0,p/).  Figure 
4  shows  tte  basic  idea.  At  this  point  it  has  to  be  said  that  a  relaxed  residumoi  at  the 
roister  levd  will  not  work  as  b^  addition  and  multiplication  will  be  performed  on 
the  results.  The  idea  bases  on  the  same  idea  as  in  phase  cme  and  uses  the  feet  that  p 
has  (mly  a  few  bits  equal  to  1.  In  addititm  an  evaluation  ci  the  two  2'*'^  Ints  of  the 
redundant  number  representation  becomes  necessary.  Taking  the  consuaints  on  the 
Pi  elemertts  this  res^  in  an  A'  out  of  the  irrter^  [0,^/).  At  this  pmrrt  carry 
evaluation  has  to  be  performed.  Since  p/^p,  =  2'*,  a  simple  decision  gives  the  result 
X or  A'+pr2'‘  as  the  final  result  at  register  level. 


(z2/2**r)*i 


o  Y=X-n 


Figure  4:  The  second  phase  of  ^  +Ci  MOD 


4  The  generation  algorithm 

This  section  focuses  on  key  generation  within  realistic  ranges.  As  seen  from  the 
constraints  (a),(b),(c)  of  figure  1  put  nudnty  on  the  values  NM  and  D  it  is  not 
possfole  to  fidfiD  thm  for  the  aibitraiy  case.  The  presented  assumption  and  the 
proposed  algorithm  allow  rally  a  restricted  subset  D  is  key  sensible  A"  and  A/ 
are  oidy  idevant  for  the  pioeoessing  phase.  However,  fin  realistic  relations  between 
the  noisier  lengdi  n  of  the  indiviiloai  RNS  r^;islefs,  and  the  total  length  of  the  key 
^  exist  in  a  salisfoctoiy  variety. 

For  a  given  and  A/ these  oonstraints  can  be  rewritten  as : 


In  the  qiedal  situation,  logUP)  is  assumed  to  range  from  512  to  1024  and  n  is 
to  be  equal  to  42.  This  value  results  from  the  foct  that  at  the  roister  level  a 
Wallace  tree  [11]  is  inqdemented.  Optimal  roister  lengths  fin  the  Wallace  tree 
would  be  28, 42, 63  etc. 

For  farther  considerations  detailed  calculations  are  given  for  log^Dy=6T2  and  r=42. 
In  this  case  32  suitable  idative  primes  qi  are  needed.  As  sketched  above  a 
iramberp  is  quoted  to  be  suitable  if  it  is  to  ^  and  difference  2''-p  has  rally  few 

ones.  Practical  tests  show  that  many  more  possiUe  values  fin  the  pi,qi  would  exist. 
But  die  restiictioos  fin  the  roister  otiemed  arithmetic  show  that  all  thM  values  are 
very  dose  to  "f.  This  means  fix  practical  estimations  that  the  existence  of  maiqr 
such  values  does  not  hd^  to  enfa^  dK  interval  to  dioose  D  a£  The  previous 

■rartanwrtK  ate  not  true  fix  p;.  This  value  is  diosen  P|  s  This  results  in  the 
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estunation  that  4NsM.  The  idatioos  (e)  and  (0  show  the  lestrictioiis  on  base 


(e)iV  =  nft  •  A  =  2"-/u,./i,=  =  . 

•  <%  f  *  at* 


M. 


M 


(f)  A#  =  04  :  4  =  2"  -  A,. A,  =  Z/;.  *2^.  ^ 4:i  =  1.*  . 

f-I  y-r  y-r 

As  already  pointed  oat  all  p  and  q  values  are  cctremely  dose  to  2^  This  means  that 

2^ 

the  fidlowing  estimatioiis  A/  =  2^-Aj|^,  «2^,  and  A^  =  —  -A;^,  with 

4 

Afi  «2^  can  be  used. 

With  RSA  D  is  a  product  of  two  large  primes  D= prime!* prime!.  The  restrictimi 
(e)  on  D  allows  the  fiiUawing  procedure  to  sdect  keys: 


(1) 

(2) 

(3) 

(4) 


^  4.4 


Ao«  = 


N 


(l+Cp) 


Select  prime!  < 

Sdect  prime!  soch  that  Pis  within  the  interval  [£LA..i3L«,]. 


A  short  diacassioo  on  the  availability  of  MRSA  Ictys  as  compared  to  keys  in 
conventional  RSA  is  added.  The  restrictions  <m  the  sdectkmofP  are  seen  in  (a).(b), 
and  (cX  giving  restriction  (d).  This  tells  the  key  finder  diat  for  a  given  mmiber  of 
bits  in  AT  only  a  linrited  area  of  the  representable  values  are  available  fin  P.  From  (d) 
it  can  be  seen  diat  this  area  covets  a  litde  less  than  10%  of  the  space.  Half  of  ttiis 
space  can  be  represented  with  one  Ut  less  so  that  effecdvety  18%  can  be  used  fin 
keys.  This  is  also  shown  in  figure  5.  The  number  of  bits  in  can  be  easily  decreased 
Ity  one  by  catting  one  of  the  RNS  registeis.  However  it  should  be  staled  that  in  the 
case  of  efficient  roister  oriented  modulo  reduction  this  does  mean  the  use  of 
additional  hardware  or  the  predefinition  of  the  key  space.  As  this  restriction  allows 
only  fin  twice  die  amount  of  the  availaUe  k^  and  as  there  are  sufficiently  many 
ktys  availabie,  this  is  not  assumed  fin  jnacticai  qipUcalknis. 


Figure  S:  Key  space  of  MRSA  in  oraqnrison  to  RSA. 

It  must  also  be  mentioiied  that  the  size  of  D  is  adopting  to  N  and  not  to  the  block 
size  of  the  plain  text  As  this  method  assumes  the  pfadn  text  block  to  hold  k  bits  less 
than  N  the  relative  key  ^paoe  as  compared  to  the  size  of  the  plain  text  blocks  exceeds 
by  ooBwentioiial  RSA.  Finally  it  should  be  stressed  that  there  is  no  idea  that  this 
stiucture  of  restricting  on  the  key  spaot  could  influence  the  ciyptogiaidiic  quality  of 
the  system  to  any  extent  As  all  the  constndms  on  D  just  limit  the  imerval  whereof 
D  may  be  chosen  this  &ct  does  not  influence  key  tpiality  at  all  as  demanded  by  the 
RSA  encryption  system  [13]. 

Due  to  the  nature  of  the  MRSA  algcxrithm  dkey  and  ek^  need  not  to  observe 
constraints  and  can  therefine  be  diosen  the  same  way  as  usual  RSA  keys.  This 
means  that  one  of  the  two  is  a  randomly  chosen  nmhber  and  the  second  is  computed 
by  applacatkm  of  Berlecamp  algorithm  [12]  to  find  the  nmlti^hcative  inverse  with 
reject  to  0(d). 
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Abstract 

The  concept  of  an  interface  for  electronic  mail  between  a  LAN  (Lo¬ 
cal  Area  Network)  within  an  organisation  and  an  external  network 
is  presented.  The  interface’s  design  renders  any  external  intrusion 
impossible.  The  concept  is  based  on  a  straightforward  hardware 
solution  and  has  already  been  validated  by  a  prototype  implemen¬ 
tation. 

1  Introduction 

The  use  of  computer  networks  in  all  fields  of  business,  government,  and  sci¬ 
ence  is  rapidly  growing.  Information  interchange  is  often  a  decisive  factor  for 
competition.  Nevertheless,  owing  to  numerous  cases  of  intrusion  via  the  com¬ 
munication  networks,  organisations  are  becoming  more  and  more  concerned 
about  security  and  reluctant  to  extend  their  utilisation  of  external  network 
communication;  some  are  even  discontinuing  network  communication. 

Computer  “viruses”  represent  an  enormous  security  risk.  They  endanger  the 
functionality  and  integrity  of  computer  systems  and  the  permanent  integrity 
of  data  by  using  more  and  more  elaborate  ways  of  manipulation.  Apart  from 
intruding  a  system  via  exchangeable  media  such  as  floppy  disks,  the  most  fre¬ 
quent  way  of  entering  a  computer  system  is  from  a  computer  network.  To  show 
how  computer  “viruses”  gain  control  over  a  computer  system,  we  first  give  an 
overview  over  the  different  kinds  of  “viruses”  and  their  ways  of  intrusion. 
Computer  “viruses”  are  non-self-acting  parts  of  computer  programs  (host  pro¬ 
grams)  that  are  copied  into  memory  by  starting  a  host  program.  There  are  dif¬ 
ferent  kinds  of  computer  viruses,  e.g.,  ‘^Trojan  horsed' ,  “time  bomb^ ,  “worms” , 
“boot  sector  viruses”,  and  “hybrid  virusei’ .  The  variety  of  kinds  of  compu¬ 
ter  “viruses”  is  expected  to  grow  further  in  the  future.  Generally  speaking,  a 
computer  “virus”  consists  of  two  parts: 

•  Program  routines  that  have  to  look  for  further  programs  on  the  host 
system  to  which  the  “virus”  can  copy  itself.  This  is  used  to  increase  the 
number  of  the  “virus”  program’s  copies  in  a  system 
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•  Program  routines  that  have  to  execute  one  or  more  manipulating  func¬ 
tions.  Most  of  the  manipulation  is  destruction  or  change  of  data  which 
may  cause  great  economical  and  financial  harm. 


There  are  two  different  ways  of  “infecting”  a  program.  One  way  is  to  overwrite 
the  host  program  code  that  is  stored  on  a  writable  medium.  Then,  at  least  parts 
of  the  original  program  are  destroyed  and  the  “virus”  may  easily  be  detected  as 
the  application  is  unable  to  run.  The  other  way  of  infection  is  that  the  “virus” 
adds  itself  to  the  body  of  a  program  and  adjusts  the  program’s  entry  point  and 
address  tables. 


file  header 
address  table 


modiTied  loading  address 
oomipted  program 


■aipiMaibjra 
oannilias  ww 


■npSeiabgra 

iffiiidinifiMMlh 

djjyNdattwtatW 

Figure  1:  Structure  of  a  proper  and  of  corrupted  program  files 

The  “virus”  code  is  loaded  into  memory  together  with  the  host  program  where 
it  may  stay  resident  and  may  “contaminate”  further  programs.  A  “contami¬ 
nation”  may  be  detected  by  an  extension  of  the  host  program’s  length  in  the 
directory.  Many  viruses  keep  cross-links  to  the  host  program  and  execute  parts 
of  the  host  program  while  simultaneously  carrying  through  their  manipulations 
in  a  way  not  recognisable  by  the  user. 

In  the  next  section  we  begin  with  an  overview  about  computer  “viruses”  out¬ 
lining  the  ways  of  “infecting”  computer  systems  and  the  different  classes  of 
“viruses”  which  can  be  distinguished.  Next,  the  existing  concepts  to  protect 
computer  systems  against  manipulation  by  “viruses”  are  discussed.  Most  of 
them  are  software- based,  but  there  are  also  a  few  hardware-orientated  con¬ 
cepts.  After  a  comparison  of  the  von-Neumann  computer  architecture  with  the 
Harvard  architecture,  a  hardware-based  “virus-resistant”  network  interface  is 
introduced  and  details  of  the  prototype  implementation  are  presented. 


mod.  file  header 
mod.  address  table 


original  loading  address 
orig.  program  code 


loading  address 
■virus*  program  code 

jumptoorig. 
loading  address 


file  header 
address  table 


loading  address 
program  code 
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2  Kinds  of  Computer  “Viruses” 
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Computer  “worms” can  often  be  found  in  multi-user  systems  and  computer  net¬ 
works,  as  they  keep  busy  with  copying  themselves  all  over  the  system,  which 
leads  to  performance  reduction. 

“Virus”  programs  that  catch  the  user’s  interest  and  that  start  their  destruction 
immediately  after  being  executed  are  called  “Trojan  horsed . 

“Time  bomb^’  or  "logical  bomb^  are  very  similar  to  “Trojan  horses”,  but  they 
start  their  manipulation  when  they  detect  a  particular  logical  or  temporal  sta¬ 
tus.  This  can  be  a  special  time  or  date.  The  “Michelangelo  Virus”  is  one  of 
the  “time  bombs”  as  it  starts  its  destruction  at  Michelangelo’s  birthday  every 
year  by  reading  the  date  from  the  system  clock  [2]. 

“Boot  sector  viruses  are  placed  in  the  boot  sector  of  a  system  disk  in  order 
to  gain  control  over  a  computer  system  already  before  the  proper  operating 
system  is  loaded  into  memory.  They  can  hardly  be  detected  as  they  can  keep 
control  over  all  protective  devices  and  mechanisms  that  are  offered  by  the  ope¬ 
rating  system  and  can  bypass  protection  measures. 

The  group  of  “hybrid  viruses  is  the  latest  variety  of  “viruses” .  They  are  resi¬ 
dent  in  memory  (like  “boot  sector  viruses”)  and,  moreover,  they  exist  in  files  of 
computer  systems.  If  only  the  “infected”  files  are  removed  from  the  system,  the 
memory  resident  part  of  a  “hybrid  virus”  is  still  active  and  may  start  infecting 
files  anew. 


3  Existing  Concepts  of  “Virus”  Prevention 

3.1  Software>Based  Solutions 

As  already  mentioned,  most  of  the  protection  measures  are  software-based. 
They  can  be  classified  into: 

•  Memory-resident  “watchers”,  i.e.,  programs  that  guide  (operating)  sy¬ 
stem  functions  and  interrupts.  They  try  to  secure  a  system  of  being 
corrupted. 

•  Secondary  programs  and  utilities  that  have  to  be  executed  regularly  to 
scan  a  system  and  its  storage  devices  (e.g.,  floppy  disks,  hard  disks  etc.) 
and  search  for  “infected”  data,  especially  “infected”  executable  files. 

The  purpose  of  memory  resident  utility  programs  is  to  control  a  system,  espe¬ 
cially  its  input  and  output  (I/O)  functions,  as  computer  “viruses”  have  to  read 
and  write  back  the  manipulated  data.  These  utilities  have  to  give  a  message 
if  there  is  a  possible  misuse  of  I/O.  The  problem  is  that  the  permanent  func¬ 
tion  control  leads  to  a  p>erformance  reduction  or  that  the  controlling  program 
cannot  decide  on  its  own  whether  an  I/O  request  is  intended  or  caused  by  a 
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“virus”  manipulation. 

The  second  group  of  software  solutions  are  characterised  by  scanning  systems 
for  manipulated  data  and  are,  therefore,  called  “virus  scanners”.  Corrupted 
programs  can  be  detected  by  sequences  of  characteristic  bytes  or  string  pat¬ 
terns  that  are  supplied  with  each  individual  computer  “virus”.  A  scanner’s 
quality  depends  on  the  number  of  known  viruses  that  it  can  search  for,  and 
this  manifests  its  disadvantage:  as  the  number  of  newly  developed  “viruses” 
is  growing,  the  scanner  has  always  to  be  updated.  This  makes  it  clear  that  a 
scanner  can  only  react  to  known  ways  of  manipulation  and  cannot  be  regarded 
as  a  preventive  measure. 

3.2  Hardware-Based  Solutions 

Some  companies  supply  additional  hardware  devices  in  order  to  make  compu¬ 
ter  systems  secure.  This  starts  with  simple  mechanical  barriers  to  lock  floppy 
disk  drives  and  to  prevent  new  probably  corrupted  software  from  entering  a 
system  via  exchangeable  floppy  disks.  Furthermore,  there  are  extra  boards  and 
smart  cards  equipped  with  cryptographic  devices.  They  are  used  to  encode 
and  decode  data  with  user-dependent  (e.g.,  the  user  identiflcation  in  compu¬ 
ter  networks,  uid)  keys  to  ensure  that  only  authorised  users  can  get  access  to 
security  sensitive  data.  These  boards  are  individually  designed  and  manufac¬ 
tured.  Their  use  increases  the  economic  and  administrative  costs  and  efforts 
for  computer  support  and  maintenance. 


4  Computer  Architectures 

4.1  The  von  Neumann  Architecture 

The  classical  von  Neumann  computer  architecture  makes  it  really  easy  for  com¬ 
puter  “viruses”  to  “infect”  other  programs  and  to  take  control  over  a  computer 
system  after  a  manipulated  host  program  is  loaded  into  memory,  since  pro¬ 
gram  code  and  data  are  stored  in  a  common  random  access  memory  (RAM), 
and  since  any  word  in  memory  can  be  fetched  and  executed  as  an  instruction. 
Abstractly  speaking,  this  architecture  consists  of  the  central  processing  unit 
(CPU)  with  only  a  few  internal  memory  cells  (registers),  external  memory  cells 
in  form  of  RAM,  and  external,  peripheral  devices.  RAM  is  used  to  store  pro¬ 
gram  code  that  is  successively  loaded  in  the  CPU’s  instruction  register.  The 
communication  between  the  CPU  and  its  RAM  is  carried  out  via  buses.  The 
von  Neumann  architecture  comprises  an  address  bus  and  a  data  bus.  There 
is  only  a  separation  between  address  and  data.  Whether  a  binary  word  in 
memory  is  an  instruction,  a  constant,  or  a  data  word  can  at  runtime  only  be 
perceived  from  the  context.  In  addition  to  this,  modern  operating  systems  do 
not  offer  any  support  to  supervise  the  instruction  fetch  cycle  of  the  CPU. 
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Moreover,  also  the  external  storage  media,  from  which  programs  are  loaded  into 
main  memory,  can  generally  be  rewritten.  This  means  that  a  proper  program 
once  stored  into  memory  cannot  prevent  it  from  becoming  corrupted. 


A 


data 

bus 


A 


adress 

bus 


V  V 


Figure  2:  The  von  Neumann  architecture 


4.2  The  Harvard  Architecture 

The  analysis  of  the  reasons  enabling  the  "virus”  problem  suggests  a  straightfor¬ 
ward  and  fully  effective  solution  by  hardware,  such  as  the  Harvard  computer 
architecture  physically  separating  program  and  data  memories.  The  above 
von  Neumann  architecture  is  extended  by  two  further  buses,  viz.,  one  each  for 
instruction  addresses  and  instruction  data.  This  ensures  that  data  and  instruc¬ 
tions  are  accessed  via  two  different  non-multiplexed  buses  [3].  The  instruction 
bus  (address  and  data  bus)  only  offers  physical  access  to  the  instruction  regi¬ 
ster  of  the  CPU,  while  the  data  bus  exclusively  serves  the  data  and  operand 
registers. 

Originally,  this  design  was  developed  to  increase  performance,  as  for  the  inde¬ 
pendent  buses  the  instruction  cycle  and  the  data  fetch  cycle  can  be  executed 
simultaneously.  With  regard  to  a  security  sensitive  computer  application,  this 
architecture  represents  the  better  design,  because  it  can  protect  program  code 
from  corruption. 


Cl 


CODE 


DATA 


Figure  3:  The  Harvard  architecture 


5  Concepts  of  Protection 

In  order  to  prevent  a  “virus”  from  entering  a  system  in  the  first  place,  and 
to  ensure  that  (program-)  files  do  not  become  manipulated  and  changed  after¬ 
wards,  it  is  sufficient  to  implement  program  memories  exclusively  as  read  only 
memories  (ROMs).  The  possibility  of  any  kind  of  manipulation  and  modifica¬ 
tion  of  binary  program  code  is  thus  systematically  disabled.  Naturally,  it  has 
to  be  made  sure  that  any  program  code  which  is  placed  into  ROM  is  free  of 
any  destructive  machine  instructions. 


The  prototype  for  the  network  interface,  which  we  have  built,  is  structured 
according  to  the  above  reasoning.  It  is  based  on  the  Z80  CPU.  Though  its 
architecture  corresponds  to  the  von  Neumann  architecture  it  is  suited  for  the 
prototype  implementation  with  small  additional  feature  added.  The  Z80  CPU 
works  with  eight  bits  for  data  and  a  sixteen  bits  wide  address  bus.  The  latter 
is  physically  realised  by  the  CPU’s  sixteen  address  pins  called  AO  to  A15. 


The  software  required  for  the  purpose  of  a  network  interface,  i.e.,  operating 
system  kernel,  user  interface,  editor,  and  the  KERMIT  protocol,  is  provided  in 
a  ROM  module  occupying  the  lower  32  kB  of  the  CPU’s  64  kB  physical  address 
space.  The  upper  32  kB  are  used  for  data  at  runtime.  To  emulate  the  Harvard 
architecture  with  the  Z80  we  took  advant^e  of  a  particular  signal  called  Ml 
available  at  a  pin  of  the  CPU  chip.  Ml  indicates  the  instruction  fetch  phase 
within  the  instruction  execution  cycle.  We  used  the  disjunction  of  Ml  and  the 
inverted  address-line  A15  as  input  at  the  RESET-pin  of  the  Z80  CPU,  i.e., 
trying  to  read  an  instruction  word  in  the  data  space  of  main  memory  results  in 
the  immediate  resetting  of  the  CPU.  This  is  an  additional  feature  coping  with 
errors  in  the  application  software. 


Figure  4:  Additional  circuitry  in  the  prototype  implementation 


Usually,  private  networks  are  directly  and  physically  connected  to  public  ones 
and  allow  for  access  from  outside  (remote  login,  ftp,  telnet  etc.).  This  bears 
the  risk  that  “contaminations”  can  enter  from  outside.  In  contrast  to  this,  our 
interface  is  designed  to  serve  as  a  buffer  between  two  communication  networks. 
Owing  to  the  alternate  switches  shown  in  the  Figure  5,  it  can  only  communicate 
with  one  network  at  a  time,  making  direct  links  between  internal  and  external 
networks  impossible.  Furthermore,  any  communication  is  initiated  and  actively 
carried  through  by  the  interface.  Mail  data  are  fetched  in  file  transfer  mode 
from  one  of  the  two  networks.  They  are  then  stored  in  the  interface  on  disk  and 
may  be  manipulated  with  the  editor.  Not  before  turning  the  alternate  switch 
the  data  can  be  forwarded  to  the  other  network. 
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Figure  5:  Alternate  switch  to  avoid  a  direct  link  in  the  communication  line 


6  Conclusion 

Although  based  on  the  classical  von  Neumann  architecture,  with  just  a  few 
technical  modification,  the  presented  prototype  implements  a  fully  effective 
protection  against  ** virus”  intrusion.  This  is  achieved  by  the  following  features. 

•  Program  code  is  unchangeably  stored  in  read  only  memory  (ROM) 

•  An  additional  circuit  is  used  to  reset  the  CPU  in  case  a  program  tries  to 
fetch  an  instruction  from  RAM. 

•  A  direct  link  between  the  local  network  and  external  devices  is  made 
impossible  by  means  of  an  alternate  switch  box 

In  conclusion,  it  was  constructively  shown  that  the  problem  of  “viruses”  ente¬ 
ring  private  networks  from  public  networks  can  easily  be  solved  by  hardware 
at  negligibly  low  cost  in  a  fully  effective  way.  Thus,  the  statement  “Avoidance 
of  viruses  seams  to  be  impossible  to  attain  by  technical  means  since  they  can 
be  introduced  into  the  system  by  a  properly  authorised  user”  found  in  [1]  has 
been  refuted. 
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Abstract 

CIP  is  a  formal  method  for  the  development  of  distributed  reactive 
systems.  The  ccunpositicmal.  real  world  oriented  tQiproach  guides  the 
devel(^)er  from  an  initial  environment  modelling  step  towards  the 
complete  definition  of  the  reactive  behaviour  of  a  system.  The  descrip- 
tkm  technique  of  the  method  combines  graphical  and  textual  notations. 

1  Introduction 

Although  it  is  well  known  that  the  use  of  formal  methods  supports  die  development 
of  robust  and  reliable  systems,  there  is  still  a  great  dislike  in  applying  them  in 
practice.  A  main  reason  fw  the  bad  acceptance  of  formal  methorte  is  the  missing 
suppot  through  constructive  development  concepts,  guiding  the  user  firom  the  in¬ 
formal  requirement  description  to  the  definition  of  the  system  [e.g.  3, 6, 11, 12, 13]. 

D.  Harel  and  A.  Pnueli  have  characterized  reactive  systems  (process  control, 
onbedded  and  real-time  systems)  as  follows:  "A  reactive  system,  in  genoal,  does  not 
compute  or  perform  a  function,  but  it  is  supposed  to  maintain  a  certain  ongoing 
relatirxiship  with  its  environment  [10]."  Many  formal  methods  propose  stepwise 
refinement  techniques.  Concepts  of  modularity  smve  as  guidelines  for  the  difficult 
task  of  structuring.  Top-down  ipproaches  are  suited  for  the  development  of  trans¬ 
formational  systems  (functions,  algorithms).  The  structure  of  a  reactive  system, 
however,  must  reflect  the  tempcHul  behaviour  of  its  environment,  thus  an  outside-in 
apinoach  is  more  adequate.  A  developed  reactive  structure  may  then  serve  as  a  basis 
for  the  specification  of  functional  computation.  A  similar  qyproach  (^licative  state 
transition  systems)  has  been  propos^  even  in  1978  by  Backus  [1]  for  purely 
transformational  systems. 

CIP  bases  the  development  of  a  systnn  rm  an  explicit  model  of  its  environment 
The  real  wrxld  modelling  approach  to  system  development  is  well  known  from  the 
JSD  method  (Jadrson  System  Development)  [4].  CIP  differs  from  JSD  mainly  in  the 
state  oriented  view  of  processes,  and  by  the  possibility  to  specify  instantaneous 
interaction  between  synchronously  cooperating  system  compenents. 

2  The  CIP  Method 

2.1  System  Description  Concepts 

Operational  Specification 

The  operational  approach  to  software  engineering  [14]  has  been  proposed  as  an 
alternative  to  the  conventkmal  approach  (Analysis,  Design,  Coding).  An  operational 
specification  is  a  problem  oriented  system  description  which  is  executable  by  a 
suitable  interpreter. 
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Instantantcneous  Reactivity 

A  system  descripticHi  which  allows  the  specification  of  system  reactions  dependent 
on  the  actual  state  of  the  environment,  must  abstract  fitom  the  non-zero  duration  of 
response  times.  Instantaneous  reactivity  is  therefore  the  commonly  accepted 
hypothesis  for  the  development  of  reactive  systems  [2, 5, 11].  The  assumption 
states  that  a  syst^  responds  instantaneously  to  its  inputs.  A  ctmsumed  event  and 
the  generated  actimis  compose  thus  a  temporally  atomic  system  reaction. 

Sequential  Components 

We  detine  a  system,  in  ccHitrast  to  process  algebras  or  petri  nets,  as  static 
composition  of  sequential  components.  Sequential  components  cooperate  syn¬ 
chronously  or  asynchronously.  Synchronous  cooperation  is  usefull  for  the 
description  of  instantaneous  system  reaction.  Distributed  environments  however 
require  a  fomalism  including  asynchronous  cooperation. 

System  description  based  on  statically  compo^  sequential  ccxnponents  is  suitable 
for  the  development  of  safe  critical  systems.  Due  to  the  time  indq)endent  state  space 
structure,  such  systems  are  much  easier  to  treat  than  speciticadons  based  on  more 
general  OKxlels  of  concurrracy  (non-conservadve  systems).  Furthomore,  the  ccxistant 
degree  of  concurrency  simplifies  considerably  the  verification  of  the  instantaneous 
reactivity  assumption  for  implonented  systms. 

2.2  Development  Steps: 

Real  World  Model,  Correspondence  and  Composition 
The  general  task  is  always  related  to  an  environment  composed  of  real  objects: 
machines,  chemical  processes,  lanes  of  traffic  at  an  intersection,  etc.,  but  the 
environment  may  also  include  previously-developed  software  components.  The 
developer  is  asked  to  construct  a  specific  software  interaction  between  these  objects. 
CIP  differs  from  most  other  methods  in  that  its  development  process  starts  with  an 
environment  model  and  posqmnes  the  functional  description  to  a  second  phase. 

In  the  first  phase,  the  environment  objects  are  modeled  in  terms  of  state  machines. 
For  every  modeled  object  a  corresponding  model  process  is  specified.  These 
provisionally  incomplete  system  components  define  in  fact  an  interaction  protocol 
for  the  system  and  its  environment  In  an  implemented  system  the  corresponding 
synchronisation  takes  place  by  means  of  transmitted  events  and  actions. 

The  second  phase  of  development  involves  the  introduction  of  function  processes, 
allowing  for  instantaneous  interaction  and  asynchronous  communication  to  take 
place  between  the  established  components.  Here,  the  tq^lication  of  classical  concepts 
of  modularity  strategically  suppmts  the  composition  process.  The  result  is  a  netwrvk 
of  interacting  and  communicating  extended  state  machines  that  ivovide  an  operational 
description  of  the  system. 

Spe^cations  based  on  real  world  models  are  transparent  and  easy  to  understand 
because  all  elements  are  clearly  related  to  the  environment  In  a  running  system,  the 
current  states  of  the  model  processes  always  ccnrespond  to  the  current  states  of  the 
associated  real  objects.  This  is  an  important  prerequisite  for  the  development  of 
robust  and  safe  systems.  The  approach  also  has  advantages  as  regards  system 
maintenaiKe,  since  a  real  wtvld  model  is  likely  to  be  more  robust  than  a  set  of 
functional  requirements. 
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3  CIP  System  Description 

A  CIP  q)ecificati(m  provides  an  operati(Hial  description  of  how  a  system  reacts  to 
external  events.  The  reactive  behaviour  is  described  by  extended  state  machines  that 
can  influence  each  odier.  Data  processing  and  algorithmic  functions  are  carried  out  in 
the  transitions  of  the  state  machines.  Transition  structures  and  data  flow  networks  are 
grafrfiically  specified  while  functions  and  conditions  are  defined  through  aimotations 
in  a  functional  language. 

Systems 

A  CIP  System  consists  of  several  concurrent  clusters.  A  cluster  is  a  sequential 
subsystem  composed  of  a  set  of  synchronously  cooperating  processes.  A  state 
transition  of  a  cluster  represents  an  instantaneous  system  transition,  defined  through 
the  state  transitions  of  its  components.  The  processes  of  a  cluster  interact  through 
instantaneous  transmission  of  pulses  (software  events).  All  clusters  may  contain 
processes,  which  communicate  dirough  asynchronous  exchange  of  messages. 
Processes 

A  process  is  an  extended  state  machine  which  can  carry  out  internal  operations 
through  its  transitions.  We  distinguish  interaction  driven  processes  (I-process)  and 
communication  driven  processes  (C-process).  The  state  transitions  of  an  I-i»xx:ess  are 
triggered  by  external  events  and  by  pulses  emitted  by  other  processes.  Occurring 
events  must  always  be  accepted  white  occurring  pulses  may  be  ignored.  A  singular 
output  of  an  I-process  can  consist  of  a  pulse  for  other  I-processes  of  the  cluster  and  of 
an  action  for  the  system  environment.  Messages  for  other  processes  can  be 
transmitted  through  specified  outports.  A  transition  of  a  C-process  occurs 
spontaneously  when  a  message  is  pending  at  one  of  its  inports.  The  output  of  a  C- 
process  may  also  consist  of  a  pulse  and  of  several  messages. 

The  transition  gr!q)h  of  a  process  may  contain  non-deterministic  branchings,  i.e.  in 
a  given  state,  several  transitions  are  possible  for  the  same  input.  In  ordCT  to  resolve 
this  ambiguity,  switch  conditions  must  be  defined  that  can  be  dependent  on  the  states 
and  the  variables  of  the  processes  within  the  same  cluster.  Compared  to  pulse 
transmission,  state  inspection  represents  a  much  weaker  coupling,  because  the 
inspected  processes  are  not  affected. 

A  process'  local  memory  can  be  extended  with  variables.  In  ord^  to  support  data 
transmission  for  each  event,  action,  pulse  and  message,  corresponding  record  types 
are  declared.  Transitions  are  annotate  by  sequences  of  operations,  which  are  ideally 
described  in  a  functional  language. 

Moderated  Processes 

A  moderated  process  consists  of  several  alternative  modes.  The  several  modes  of  a 
imxx:ss  differ  in  their  transition  graphs.  The  state  space  and  the  into’face,  howevo*, 
are  the  same  for  all  modes  of  a  given  process.  The  dynamics  of  mode  transitions  is 
defined  through  a  further  transition  strucoire  called  moderator,  which  is  built  on  the 
modes  themselves.  A  mode  transition  may  trigger  a  state  transition  of  the  new  mode. 
Instantaneous  Interaction 

The  static  pulse  flow  structure  of  a  cluster  is  specified  trough  an  interaction  net 
Through  pulse  transmission  every  process  which  expects  external  input 
(events,  messages)  can  cause  instantaneous  chain  reactions.  In  order  to  prevent  cyclic 
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chains,  a  cascade  must  be  specified  for  every  process  with  external  input  A  cascade  is 
a  cycleftee  partially  (Hdeted  subnet  of  the  interaction  net  which  defines  the  possible 
paths  of  pulse  transmission  caused  by  its  top  process.  The  transitions  of  the 
processes  activated  in  a  cascade  define  a  temporally  atnnic  clusto^  transition. 
Asynchronous  Communication 

The  message  flow  in  a  system  is  defined  dirough  networks  of  processes  linked  by 
datastreams.  A  datastream  is  a  FiFo-buffer  which  stores  the  messages  arriving  from 
the  cotmected  process  outports.  When  the  cormected  reader  process  awaits  a  new 
message,  the  oldest  message  is  released  to  the  corresponding  inport.  External 
datastreams  ate  amnected  to  devices. 

Denotational  Semantics 

The  meaning  of  language  constructs  describing  synchronous  cooperation  has  been 
defined  through  SCSM-expressions  [8].  SCSM  is  a  formalism  for  the  description  of 
synchronous  compositions  of  state  machines  [7].  The  asynchronously  cooperating 
clusters  are  interxxeted  as  petri  net  state  machines  coupled  to  petri  nets  which  model 
the  behaviour  of  the  datastreams  [9]. 

Extension  of  the  Formalism 

The  implicitly  defined  tonporal  properties  of  a  system  can  be  voified  by  automated 
system  execution.  We  are  working  on  an  extension  of  CIP  by  a  logical  language  for 
the  decription  of  temporal  properties  of  the  controlled  environment.  The  verification 
of  tempcnal  predicates  can  be  based  on  the  common  oivironment  model. 


4  A  Complete  Case  Study 

We  specify  a  simple  system  which  can  be  d^ribed  by  pure  state  machines  only. 

4.1  Requirement  Description  of  the  AccessControlSystem 
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Hg.  1  System  Environment 

An  arriving  visitor  must  insert  his  badge  into  the  scanner  in  order  to  be  identified. 
For  an  accepted  visitor  the  door  is  automatically  opened  and  the  possibly  idle 
conveyor  is  started.  The  opened  door  closes  as  soon  as  the  light  barrier  gets  free.  The 
closing  door  reopens  when  the  light  barrier  is  interrupted  again.  The  scann^  is  only 
enabled  when  the  door  is  fully  closed. 

Entries  are  signalized  to  the  reception  desk  inside  the  building  where  entered 
visitors  have  to  regista.  If  no  mc»e  visitors  are  on  the  way  to  the  recq)tion  desk,  the 
conveyor  of  the  entrance  is  turned  off.  There  are  at  most  three  visitors  allowed  to  be 
on  the  way  for  registering. 

A  switch  allows  to  enable  (x  disable  the  access  control  system.  If  the  system  is 
disabled,  an  already  accepted  visitor  is  still  allowed  to  enter.  However,  when  the  light 
barrier  is  interrupted,  the  eventually  closing  door  is  stopped  without  reopening.  At 
any  time  the  system  can  be  enabled  to  wtxk  again  in  its  normal  mode. 

REMARK:  We  suppose  that  every  accepted  visitor  really  enters  the  door. 
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4.2  AccessControlSystem  First  Phase:  Model  Processes 
The  pfovisioiially  incomplete  model  inocesses  behave  like  their  corresponding  real 
world  objects  of  the  system  environment  They  are  going  to  be  completed  in  the 
second  develq)ment  phase  (section  4.3),  where  asynchronous  communication  and 
instantaneous  interaction  with  additional  function  {vocesses  are  introduced. 
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Fig.  2  Incomplete  Model  Processes 


4.3  AccessControlSystem  Second  Phase:  Functions 

Comments  on  the  grqrfiical  CIP  specification  below  follow  in  section  4.4. 

COMMUNICATION  NET  OF  SYSTEM  AccessControlSystem 
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Fig.  3  Interaction  and  Communication  Networks 
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4.4  Comments  on  the  specified  AccessControlSystem 

COMMUNICATION  NET 

The  system  consists  of  the  two  concuirait  clusters  Entrance  and  Reception. 

The  clustm’  Entrance  contains  all  model  processes,  the  interaction  driven  process 
Controller  and  the  communicaticMi  driven  process  Supervisor.  The  Controller 
informs  die  Supervisor  about  entmd  visittvs,  while  the  Supervisor  is  continously 
aware  d  the  numbn  of  actual  entries. 

The  clustm*  Reception  consist  of  the  communication  process  Desk  only.  Desk 
prompts  for  information  from  the  reception  desk  (devices)  in  order  to  inform  the 
Supervisor  about  arrived  visitors. 

INTERACTION  NET  OF  CLUSTER  Entrance 

The  main  fimction  of  the  Controller  consists  of  activating  and  deactivating  the 
Scanner,  the  Door  and  the  Conveyor.  The  control  dqiends  on  the  behaviour  of  the 
Scanner  and  the  Door  which  in  turn  signalize  occuring  events  to  the  Controller 
trough  correspondingly  transmitted  pulses.  The  control  of  the  Conveyor  dqiends  on 
the  number  of  the  actually  entering  visitors.  The  Controller  obtains  the  necessary 
infcHmation  trough  pulses  stemming  from  the  Supervisor. 

The  influence  of  Ae  LightBarrier  on  the  Door  is  realized  trough  direct  interaction 
between  these  two  model  processes.  Fcv  every  occuring  event  of  the  LightBarrier  a 
corresponding  pulse  is  sent  to  the  Door. 

Enahling  and  disabling  of  the  system  through  the  Switch  process  influences  the 
behaviour  of  the  Controller  and  the  Door. 

CASCADES  OF  CLUSTER  £fitrafice 

The  cascades  rqxesent  partially  ordered  and  cycle  free  subnets  of  the  interaction  net 
Every  casacade  defines  the  possible  pulse  flow  caused  by  an  occurring  event  or  a 
consumed  message  of  its  top  process. 

PROCESSES 

The  Controller  and  the  Door  are  q;)ecified  as  interactively  moderated  processes.  The 
state  transition  structures  of  their  shutting  and  normal  modes  describe  the  correspon¬ 
ding  altramative  dynamical  behaviour.  The  moderates  of  these  processes  are  stimu¬ 
lated  by  the  shut  and  access  pulses  of  the  Switch  process.  Trough  emitted  triggers 
(shutTrg,  accsTrg,  continue),  a  mode  transition  can  activate  a  state  transition  of  the 
new  mode. 

The  non-deterministic  branching  (opening.  Open)  of  the  normal  mode  of  the  Door 
process  is  resolved  trough  an  associs^  switch,  which  inspects  the  current  state  of 
\be  LightBarrier. 

The  non-deterministic  branching  (enabled.  Done)  of  the  Scanner  process  is  not 
resolved  within  this  specification.  A  corresponding  specification  based  on  extended 
state  machines  would  associate  a  switch  depending  on  the  record  data  transmitted  by 
the  seamin’  event  Done. 

REMARK 

Fbr  expository  reasons  we  identified  sent  and  received  pulses  and  messages  by  their 
name.  In  order  to  support  modularity,  the  CIP  system  description  language  associ¬ 
ates  ouqiuts  to  inputs  trough  explicit  translation  functions.  Furthermore,  abstraction 
from  the  state  inflection  mecanism  is  obtained  through  the  encapsulation  of  states 
and  variables  of  inspected  processes  by  specify  state  vector  access  procedures. 
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CIP  Tool 

A  graphical  q)ecificatkxi  tool  has  been  devd(^)ed  whidi  automatically  tests  recorded 
specifications  for  consistency.  A  code  generates  provides  modules,  which  can  be 
animated  in  the  source  code  environment,  or  which  can  be  ported  on  a  target 
ccHnputer.  In  order  to  complete  an  implementation  it  suffices  to  write  I/0-driv«rs  for 
the  transfo*  of  physical  events  into  logical  mies  and  of  logical  actions  into  physical 
ones.  In  some  instances  communicatitm  links  with  the  environment  have  to  be 
created. 
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Abstract 

The  objective  of  this  paper  is  to  give  some  reflections  about  handling 
of  exceptions  in  hard  real-time  environments,  which  is  among  the  less 
elaborated  topics  in  this  domain. 

A  classiflcation  of  possible  exceptions  in  real-time  systems  is  done,  to 
identify  the  ones  which  can  be  prevented  by  certain  design  measures  or 
avoided  by  specifying  and  servicing  them  within  their  contexts.  A  way 
to  survive  the  remaining  ones  in  a  weU-structured  and  predictable  way, 
and  as  painlessly  as  possible,  is  proposed. 


1  Introduction 

In  his  reference  paper  [14],  Stankovic  is  unmasking  several  misconceptions  in 
the  domain  of  hard  real-time  systems.  It  seems  that  the  most  characteristic 
one  is  that  real-time  computing  is  equal  to  fast  computing.  It  is  obvious  that 
computer  speed  itself  can  not  guarantee  that  the  specified  timing  requirements 
will  be  met. 

Instead,  a  different  ultimate  objective  was  set:  predictability  of  temporal  be¬ 
haviour.  Being  able  to  assure  that  a  process  will  be  served  within  a  predefined 
time  fr2une  is  of  utmost  importance.  In  multiprogramming  environments  this 
condition  can  be  expressed  as  schedulability;  the  ability  to  find  a  schedule  such 
that  each  task  will  meet  its  deadline  [16]. 

For  schedulability  analysis,  execution  times  of  tasks  must  be  known  in  advance. 
These,  however,  can  only  be  determined  if  the  system  functions  predictably. 
To  assure  overall  predictability,  all  levels  of  system  design  must  be  predictable 
in  temporal  sense,  from  the  processor  to  the  system  architecture,  language, 
operating  system,  and  exception  handling  (layer-by-layer  predictability,  [15]). 
In  recent  years,  the  domain  of  real-time  systems  substantially  gained  research 
interest.  Certain  sub-domains  have  been  examined  very  thoroughly,  such  as 
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scheduling  2uid  walysis  of  program  execution  times.  It  is  typical  that  most  of 
the  research  done  was  dedicated  to  the  higher  level  topics  and  presumes  that 
the  underlying  ones  2ire  fully  predictable. 

Exception  handling  is  one  of  the  most  severe  issues  to  be  solved  when  a  sys¬ 
tem  is  to  behave  predictably.  By  an  exception,  any  intrusion  in  the  normal 
program  flow  which  can  not  be  considered  in  schedulability  analysis  is  meant 
and  is  usually  related  to  residual  specification  and  implementation  errors,  and 
failures.  Anticipated  timing  events  and  events  from  the  environment,  which 
trigger  associated  processes  do  not  belong  to  this  category.  They  should  be 
implemented  in  a  way,  which  does  not  cause  any  non-deterministic  delays  in 
execution  of  the  running  task.  That  can  be  done  by  migrating  event  recogni¬ 
tion  and  operating  system  services  out  of  the  main  teisk  processor,  as  is  done 
in  the  Spring  project  [13],  or  proposed  in  [3]. 

When  an  exception  occurs  in  a  program,  the  latter  is  inevitably  delayed  caus¬ 
ing  a  serious  problem  with  respect  to  the  a  priori  determined  execution  time. 
Therefore,  exceptions  should  be  prevented  by  all  means,  whenever  and  wher¬ 
ever  it  is  possible  [1].  If  it  is  not  possible  to  prevent  them  to  happen,  they 
should  be  handled  in  a  consistent  and  safe  way  in  conformity  with  the  hard 
real-time  systems  design  guidelines.  The  urge  for  consistent  solution  of  excep¬ 
tion  problem  is  even  increased  by  the  fact  that  exceptions  are  often  a  result 
of  some  critical  state  of  the  system,  which  is  when  the  computer  control  aid  is 
needed  most. 


2  Classification  of  Exceptions 

In  this  section  we  attempt  to  identify  the  exceptions  appearing  in  the  hard 
real-time  environments.  For  that  reason  we  claissify  them  in  two  ways  with 
regard  (a)  to  their  origin  and  (b)  to  whether  they  can  be  prevented  or  not. 

2.1  Origins  of  Exceptions  in  Hard  Real-Time  Environ¬ 
ments 

Screening  possible  run-time  errors  in  various  programming  environments  and 
relying  on  our  experience  in  real-time  programming,  we  established  the  follow¬ 
ing  classification  of  exceptions  according  lo  their  origin: 

a)  Exceptions  caused  by  I/O  operations 

•  I/O  device  errors 

•  Invalid  device  addressing  (invalid  unit  idenitfication,  no  such  device) 

•  Exceptions  caused  by  the  file  management  (where  provided) 

b)  Exceptions  caused  by  invalid  data 

•  Traps  and  errors  concerning  irregular  results  of  operations  (overflow,  un¬ 
derflow,  undefined) 


•  Arithmetic  functions  with  illegal  run-time  parameter  values  (square  root, 
logarithms  etc.) 

•  Format  declaration/run-time  value  mismatch  or  conversion  errors  in  I/O 
operations 

•  Subscript  (array  or  string  index)  out  of  range 

•  Invalid  procedure  parameter  numbers  or  types 

c)  Errors  preventable  by  imposing  restrictions 

•  Errors  connected  with  dynamic  language  features  (insufficient  memory 
due  to  recursion  or  pointers,  dynamic  formatting,  dynamic  function  calls 
etc.) 

•  Problems  concerning  virtual  addressing 

d)  Problems  in  tasking 

•  conflict  situations  like  terminating,  suspending  or  resuming  a  non-existent 
task  etc. 

e)  System  exceptions 

•  diagnostics,  hardware  and  system  alerts  due  to  unit  failures  (e.g.  bus 
error  etc.) 

In  the  sequel  we  classify  exceptions  according  to  the  criterion  whether  they  can 
be  prevented  or  not. 

2.2  Preventable  Exceptions 

Some  exceptions  can  be  prevented  by  restricting  the  use  of  potentially  dan¬ 
gerous  features.  Compliance  with  these  restrictions  must  be  checked  by  the 
compiler. 

For  example,  only  sequential  file  organisation  and  compile-time  known  file 
names  suid  other  parameters  may  be  used.  No  dynamic  features  like  recur¬ 
sion,  references,  virtual  addressing  etc.  are  allowed. 

Another  means  to  prevent  exceptions  is  to  implement  strict  type  checking  in  the 
language,  so  that  possible  irregular  operations  can  be  reported  at  the  compile 
time  (as  an  example,  see  [6],  supported  by  corresponding  hardware  [8]). 

Since  strict  type  checking  seems  impractical,  we  suggest  to  extend  the  input 
and  output  data  types  by  two  “irregular”  values  representing  “signed  infinity” 
to  accommodate  overflows  and  underflows,  and  “undefined”  (a  solution  with 
“holes”  in  the  domain  and  “bumps”  in  the  range  y/as  already  implemented 
in  CLU  [11]).  The  “undefined”  value  is  used  when  a  non-recoverable  prob¬ 
lem  occurred  in  a  calculation  or  during  an  I/O  process  rendering  the  results 


meaningless.  A  similar  principle  is  followed  in  the  IEEE  32-bit  floating  point 
standard  [2]  and  implemented  in  the  MC6888 1  co-processor  [12],  with  a  quiet 
or  signalling  “not-arnumber  (NaN)”. 

Thus,  generated  irregular  values  do  not  raise  exceptions,  but  are  propagated  to 
the  subsequent  or  higher-level  blocks,  which  may  be  able  to  handle  them  (see 
the  example  of  an  implementation  below).  Any  operation  on  irregular  operands 
always  yields  a  result  of  irregular  type. 

Intelligent  I/O  interfaces  should  react  in  a  predefined  way,  if  a  fin2Ll  result,  which 
is  output  to  them,  is  irregular.  Reactions  on  different  irregular  types  can  be 
different.  The  interfaces  may  tolerate  them,  they  may  be  able  of  a  local  graceful 
degradation  of  their  performance  (if  the  action  to  be  taken  is  not  vital  or  can 
be  pre-programmed  for  such  cases)  or,  if  inevitable,  recognise  a  catastrophic 
situation.  E.g.,  if  a  regulating  system  as  a  reaction  to  a  disturbance  requires 
“as  intensive  counter-response  as  possible”  (like  a  D-regulator)  an  “infinite” 
value  may  be  produced,  resulting  in  a  maximal  possible  control  signal  which 
may  depend  on  a  type  of  the  implemented  actuator. 

2.3  Non-Preventable  Exceptions 

In  the  sequel  we  further  classify  the  non-preventable  exceptions  into  anticipated 
zuid  non-anticipated  ones.  The  former  can  be  avoided,  the  latter,  however,  must 
be  handled  in  a  consistent  and  safe  way.  A  reference  study  in  the  domadn  of 
non-preventable  exceptions  was  done  by  Cristian  [4,  5]. 

2.3.1  Anticipated  Exceptions 

If  the  potential  danger  of  irregularity  can  be  recognised  during  the  design  time, 
it  has  to  be  taken  care  of  in  the  specifications. 

For  example,  peripheral  devices  shall  be  intelligent,  fault- tolerant  and  self¬ 
checking  in  order  to  be  able  to  recognise  their  own  malfunctions  and  to  insulate 
the  effects  of  the  latter  in  “watertight”  compartments.  They  shall  react  rea¬ 
sonably  in  conflict  situations.  When  the  error  is  recoverable,  they  should  try 
to  recover  locally  using  fault-  tolerance  principles  (self  checking,  redundancy 
etc.). 

A  number  of  exceptions  resulting  from  irregular  data  can  be  avoided  by  pro¬ 
phylactic  run-time  checks  before  entering  critical  operations.  Many  tasking 
errors  are  also  avoidable  by  previously  using  monadic  operations  to  check  the 
system  state. 

An  obvious  and  frequently  used  way  of  avoiding  critical  failures  in  hard  real¬ 
time  systems  design  is  redundancy.  Redundant  system  components  must  be 
implemented  according  to  thorough  analysis  of  fault  hypothesis.  The  latter 
should  beside  the  physical  faults  in  the  operation  also  include  errors  in  the 
design  and  implementation  of  hardware  and  software  components.  E.g.,  in 
avionics  implementations  of  redundant  systems  can  be  found,  b2ised  on  different 
processors  and  done  by  different  teams. 

An  example  of  consistent  implementation  of  redundancy  is  the  MARS  system 
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[10].  Components  possess  self-checking  properties  and  produce  either  correct 
or  no  results  (fail  silently);  in  the  latter  case  the  redundant  component’s  re¬ 
sults  are  t2tken.  To  determine  where  and  to  what  extent  redundancy  should  be 
applied,  the  Mars  Reliability  Predictor  and  Low-Cost  Estimator  (MARPLE) 
was  implemented.  Programs  written  in  general-purpose  design  language  for  dis¬ 
tributed  systems  are  tr2uislated  into  reliabiUty  models,  which  aure  then  analysed 
by  the  SymboUc  Hierarchical  Automated  Reliability  and  Performance  Evalua¬ 
tor  (SHARPE)  and  several  parameters  are  produced.  Based  on  these  parame¬ 
ters,  dependability  can  be  estimated.  However,  if  a  system  is  extremely  safety 
critical,  also  the  failure  of  redundant  devices  must  be  taken  into  account,  in 
spite  of  the  low  probability  of  such  an  event. 

2.3.2  Non  Anticipated  Exceptions 

If  there  is  no  way  to  predict  an  error,  the  exception  caused  must  be  handled 
in  order  to  survive  it.  These  are  situations  when  “the  impossible  happens”  [1], 
in  which  programs  do  not  follow  their  specifications  due  to  hardware  failures, 
residual  software  errors  or  wrong  specifications.  For  example,  failure  of  a  part 
of  memory  can  result  in  the  change  of  constant  values;  an  error  in  file  mein- 
agement  or  on  a  disk  is  usually  unexpected.  In  safety-critical  control  systems 
the  non-anticipated  exceptions  may  have  catastrophic  consequences.  There  it 
is  especially  important  to  implement  a  mechanism  for  their  safe  and  consistent 
handling. 

In  his  early  paper,  Goodenough  [7]  presented  an  idea  of  assigning  default- 
or  programmed  exception  handlers  to  every  potentially  dangerous  operation. 
According  to  severity  of  the  exception  raised  the  running  process  was  either 
terminated  or  suspended  and  resumed  later.  Similar  mechanism  although  con¬ 
siderably  more  elaborated  and  adopted  for  use  in  hard  real-time  systems  was 
implemented  in  Real-Time  Euclid  [9].  There,  exception  handlers  were  (op¬ 
tionally)  located  within  block  constructs  and  were  executed  in  a  case  of  an 
exception.  If  there  were  no  exceptions  they  had  no  effect  except  for  their  im¬ 
pact  on  the  block’s  execution  time  estimated  by  schedulability  analyser  thus 
making  it  more  difficult  to  be  scheduled.  Exceptions  may  be  raised  by  kill,  ter¬ 
minate  or  except  statements,  to  terminate  a  process  entirely  or  only  its  frame, 
or  to  execute  the  handler  without  termination  of  the  process,  respectively. 

3  Coping  with  Non-Prevent  able  Exceptions 

To  handle  catastrophes  we  propose  a  combination  of  preconditions,  postcondi¬ 
tions  and  modified  recovery  blocks  implementing  both  backward  and  forward 
recovery.  Its  syntax  is  following: 

block  ;;=  block-begin  block-tail 

block-begin  ::=  BEGIN  |  PROCEDURE  parameters  ic  attributes;  ] 

TASK  parameters  &  attributes;  ]  parameters  REPEAT 
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block-tail  ::=  [declaration-sequence]  [alternative .sequence]  END; 

alternative  .sequence  ::=  {[ALTERNATIVE  [PRE  bool-exp;]  [POST  bool-exp;]] 

[RESTORE]  statement-sequence} 

A  block  (task,  procedure  or  other  block  structure)  consists  of  alternative  se¬ 
quences  of  statements.  Every  alternative  can  have  its  own  pre-  and/or  post¬ 
conditions,  presented  by  Boolean  expressions.  When  program  flow  enters  the 
surrounding  block,  the  initial  state  of  the  system  is  stacked  if  there  is  at  least 
one  alternative  implementing  backward  recovery  what  is  denoted  by  the  key¬ 
word  RESTORE.  Then,  the  first  alternative  statement  sequence,  whose  pre¬ 
condition  (if  it  exists)  is  fulfilled,  is  executed.  At  the  end,  its  post-condition  is 
checked,  and  if  this  is  also  fulfilled,  execution  of  the  block  is  successfully  termi¬ 
nated.  If  the  post-condition  is  not  fulfilled  the  next  alternative  is  checked  for 
its  pre-condition  and  eventually  executed.  If  backward  recovery  is  requested, 
the  initial  state  is  restored. 

The  alternatives  may  contain  independently  designed  and  coded  programs  to 
comply  with  specifications  and  to  eliminate  possible  implementation  problems 
or  residual  software  errors.  They  may  also  contain  alternative  design  solutions 
or  redundant  resources,  when  problems  are  expected.  A  further  possibility  is  to 
assert  less  restrictive  pre-  and/or  post-conditions  and  to  degrade  performance 
gracefully.  By  the  means  presented  in  [17]  it  is  also  possible  to  bound  the 
execution  times  of  alternatives.  If  one  of  them  fails  to  complete  inside  the 
predefined  period,  a  less  demanding  alternative  is  taken. 

If  there  is  no  alternative,  whose  pre-  and  post- conditions  are  fulfilled,  the  block 
execution  was  unsuccessful.  If  the  block  was  nested  inside  an  alternative  on  the 
next  higher  level,  this  alternative  fails  as  well  and  the  control  is  given  to  the 
next  one,  thus  providing  a  chance  to  resolve  the  problem  in  a  different  way.  On 
the  highest  level,  the  last  alternative  must  not  have  any  pre-  or  post-conditions. 
It  must  solve  the  problem  by  applying  some  conventional  actions  like  employing 
fault  tolerant  meaisures  or  performing  smooth  power-down.  Since  the  system 
is  in  extreme  and  unrecoverable  catastrophic  conditions,  different  control  and 
timing  policies  may  be  in  action,  requesting  safe  termination  of  the  process  and 
possibly  post-mortem  diagnostics. 

For  embedded  systems  it  is  important  to  consider  whether  backward  recovery  of 
certain  block  is  possible  or  not.  If  in  an  alternative  block  an  action  is  triggered 
like  commencing  a  peripheral  process  which  causes  an  irreversible  change  of 
initial  state,  it  cannot  be  restored  for  another  try.  In  this  case  only  forward 
recovery  is  possible,  bringing  the  system  to  certain  predefined,  safe,  and  stable 
state. 

The  method  inevitably  yields  pessimistic  execution  time  estimation  which  is  a 
sum  of  execution  times  of  all  alternatives  together  with  pre-  and  post-condition 
evaluation  times  and  administration  overhead.  However,  this  is  not  due  to  this 
specific  method.  In  safety-critical  hard  real-time  systems  it  is  necessary  to 
consider  the  worst  case  execution  time,  which  must  also  imply  exceptional 
conditions.  Depending  on  the  performance  reserve  of  the  system  there  may 
be  implemented  more  or  less  alternatives,  performing  more  or  less  degraded 
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functions.  In  extremely  time-critical  systems  there  may  be  implemented  a 
single  alternative  on  the  highest  level  block  only  providing  a  safe  and  smooth 
power-down. 

To  cope  with  that  problem  some  further  solutions  are  possible.  Each  sub¬ 
sequent  alternative  may  be  bounded  to  a  half  of  the  execution  time  of  the 
previous  one;  thus,  the  block  will  terminate  in  at  most  double  execution  time 
of  the  primary  alternative.  Also,  from  a  failure  of  an  alternative  it  is  possi¬ 
ble  to  deduce  which  subsequent  alternatives  in  next  blocks  are  reasonable  and 
which  not,  and  accordingly  set  their  pre-conditions.  However,  this  requires  a 
sophisticated  run-time  analyser. 

4  Conclusion 

In  order  to  assure  a  predictable  behaviour  of  real-time  systems,  it  is  necessary 
to  determine  a  priori  bounds  for  the  task  execution  times.  Exception  handling 
represents  the  most  severe  obstacle  to  this  end.  Therefore,  exceptions  were 
investigated  and  classified  with  the  objective  of  obtaining  a  remedy  for  this 
problem.  It  turned  out  that  a  large  group  of  them  can  be  either  prevented  by 
appropriate  measures  or  avoided  at  run-time.  The  others  are  coped  with  in 
a  well-structured  environment  by  providing  sequences  of  gradually  more  and 
more  evasive  software  reactions.  In  either  case,  the  run-time  behaviour  of  real¬ 
time  tasks  becomes  fully  predictable. 
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Abstract 

It  is  essential  that  safe  data  transmission  between  systems  in  a 
distributed  life-critical  application  occurs  in  a  safe  manner.  To  date 
this  transmission  has  occurred  in  a  parallel  form,  however  rising 
material  and  labour  costs  have  made  this  method  of  safe  data 
transmission  expensive.  In  this  paper  a  development  of  Fail-Safe 
Data  Transmission  System  (FSDTS),  which  can  be  used  in  these 
distributed  applications,  is  presented. 

1  Introduction 

There  are  distributed  life-critical  applications  that  require  safe  data  to  be  transmitted 
over  large  distances  e.g.  railway  interlockings.  To  date,  this  transmission  has  usually 
occurred  in  a  parallel  marmer,  which  has  become  expensive.  The  advent  of 
microprocessors  has  made  it  possible  to  develop  cost-effective  solutions  to  overcome 
this  problem.  There  is  however,  an  antipathy  to  using  microprocessor  based  systems 
in  safety-critical  applications  because  they  introduce  unidentified  and  often  diverse 
factors  which  could  result  in  life-threatening  failures. 

The  aim  of  this  paper  is  to  introduce  a  FSDTS  which  was  developed  incorporating 
techniques  that  ensure  safety  when  these  complex  devices  are  used  in  life  or  safety- 
critical  applications. 

1.1  Structure  of  Paper 

The  Fail-Safe  Data  Transmission  Project  (FSDTP)  which  includes  the  development 
of  the  FSDTS,  has  been  broken  up  into  a  number  of  phases.  This  paper  is  divided 
into  five  sections,  which  correspond  to  the  phases  of  the  project  that  have  been 
completed  and  are  currently  in  progress.  A  breakdown  of  the  paper  is  given  below; 

Section  1  (Introduction)  will  introduce  the  FSDTP  project,  giving  an  overview  of 
the  FSDTS  with  emphasis  on  the  operating  requirements  and  constraints. 
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Section  2  (Safe  Serial  Data  TVansmission)  delineates  the  method  for  achieving 
safety  in  the  transmission  of  safe-data  over  a  serial  channel  in  the  presence  of  noise. 
Issues  relating  to  error  control  mechanisms  with  emphasis  on  the  code  and 
communication  scheme  selection  will  be  discussed. 

Section  3  (Development  of  a  Fail-Safe  Data  Transceiver)  describes  how  a  Fail-Safe 
Data  Transceiver  (FSDT)  can  be  designed.  Here  hardware  safety  will  be  qualified 
and  the  architecture  of  the  FSDT  will  be  presented. 

Section  4  (Ensuring  "Error-FTee"  Software)  will  give  preliminary  results  of  this 
research.  The  aims  of  this  phase  of  the  project  will  be  presented  with  comments  on 
the  future  research  to  be  undertaken. 

Section  5  (Conclusions)  gives  an  overview  of  the  project  to  date,  future  plans  and 
the  results  from  the  FSDTs  being  monitored  in  the  field. 

1.2  FSDTP  Background 

The  FSDTP  was  initiated  to  investigate  a  method  of  overcoming  the  heavy  financial 
burden  in  maintaining  the  existing  cabling  infrastructure  used  to  transmit  safe  data 
between  various  interlockings  within  the  train  movement  control  system  (TMCS). 
Contributing  factors  to  this  were:  cost  of  new  cables,  degradation  of  the  existing 
cables,  cost  of  maintaining  the  existing  cables  and  the  theft  of  cables. 

The  aim  of  the  project  was  to  develop  a  microprocessor-based  serial  data  trans¬ 
mission  system  which  could  be  used  to  transmit  diis  safe  information.  The  system 
would  initially  be  used  to  transmit  small  amounts  of  data  between  interlockings 
within  the  TMCS.  After  a  successful  evaluation  phase  the  transceivers  would  be 
used  to  transmit  control  information  within  the  interlocking  and  to  the  train  itself. 

U  System  Overview 

The  FSDTS  must  have  the  ability  to  utilise  a  variety  of  serial  channels  and  be 
transparent  to  the  safety-system,  in  the  sense  that  the  safety  systems  must  still 
communicate  in  a  parallel  form. 


Figure  1;  FSDTS  in  Routing  Configuration 

The  FSDTS  can  either  be  used  in  a  store  and  feed  forward  configuration  shown  in 
Figure  1  above  or  in  a  master  slave  configuration  illustrated  below  in  Figure  2. 
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Figure  2:  FSDTS  in  Master>Siave  Configuration 


1.4  FSDTS  Requirements  and  Constraints 

To  offer  the  same  safety  as  a  parallel  transmission  system  the  FSDTS  must  adhere  .  > 

to  the  following  requirements  and  constraints: 

•  Safety  must  be  guaranteed  from  when  the  data  leaves  the  transmitting  safe 
system  until  it  arrives  at  the  receiving  safe  system. 

•  Due  to  the  nature  of  the  application,  safe  systems  which  are  communicating 
with  each  other  must  ensure  that  the  latest  information  is  always  transmitted  to 
the  remote  safe  systems. 

•  If  there  is  a  break  in  data  received  by  a  FSDT,  its  outputs  must  be  set  to  a  safe  ' 

state. 

•  The  integrity  of  the  channel  must  be  monitored  to  ensure  that  the  acceptable 
noise  level  is  not  exceeded.  If  the  level  is  exceeded  the  outputs  of  the  receiving 
FSDT  must  be  set  to  a  safe  state. 

^  , .  , 

•  Once  a  FSDT  has  shutdown,  no  safe  data  can  be  transmitted  in  the  charmel  to 
which  it  is  cormected. 

•  All  first  faults  which  afiect  the  safety  of  the  system  (including  those  of  tb?  fault 
detection  circuitry)  must  be  detected  within  approximately  one  second  of 
occurrence.  Faults  which  do  not  affect  the  safety  or  those  which  cause  the 
system  to  revert  to  its  safe  state  are  also  required  to  be  detected  within  a 
reasonable  time  to  fulfil  the  requirements  of  maintainability. 

•  Having  detected  the  first  fault  (affecting  safety),  the  system  must  automatically 
revert  to  safe  (predetermined)  state. 

•  The  safe  state  to  which  the  system  reverts  after  the  detection  of  the  first  fault 
must  be  irreversible  i.e.  the  system  must  not  be  able  to  become  unsafe  even 
after  the  occurrence  of  further  faults. 
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2  Safe  Serial  Data  Transmission 

Noise  which  is  inheient  in  the  transmission  system  and  noise  from  the  environment 
will  corrupt  data  transmitted  over  the  communication  channel.  To  achieve  safe  data 
transmission  it  is  necessary  to  combat  the  effects  of  this  noise.  In  this  section  we 
will  demonstrate  that  it  is  possible  to  achieve  safe  data  transmission  over  a  serial 
channel  in  the  presence  of  noise,  by  incorporating  an  error  control  strategy.  For  this 
purpose  we  will  consider  only  one  link  of  the  FSDTS. 

2.1  Selecting  an  Error  Control  Strategy 

Two  strategies  which  are  used  to  control  errors  in  a  communication  channel  are 
automatic  repeat-request  (ARQ)  and  forward-error  correction  (FEC).  In  this  type 
of  application  an  ARQ  strategy  is  preferred  if  a  reverse  channel  is  available  [1,2]. 
To  implement  effective  coding  techniques  it  is  essential  to  obtain  statistical  data 
relating  to  the  types  of  errors,  and  also  their  occurrence  rate  [3].  Once  an  analysis 
of  the  errors  occurring  in  the  communication  channel  is  complete,  it  is  possible  to 
mode!  the  channel  and  select  a  coding  scheme  that  will  be  the  most  effective  against 
the  predicted  noise. 

2.2  Expected  Errors  in  Channel 

Errors  resulting  from  noise  are  classiGed  into  two  main  categories  i.e.  random 
errors  and  burst  errors.  Random  errors  result  from  random  noise  which  is 
primarily  gaussian  noise  and  shot  noise.  These  errors  are  normally  inherent  in  the 
communication  system.  On  the  other  hand,  burst  errors  occur  due  to  natural  causes 
(e.g.  lightening)  and  man-made  sources  (c.g.  power  systems  and  electrical  machinery) 
[1,4].  As  indicated  by  [5,6]  both  random  and  burst  errors  are  expected  to  occur. 
There  is  strong  evidence  to  suggest  that  long  transmission  periods  will  exist  where 
few  errors  will  occur  (mainly  random  errors),  followed  by  short  periods  where  a 
great  number  of  mainly  burst  errors  will  occur.  This  might  be  due  to  loss  of 
synchronisation  or  from  some  form  of  burst  noise. 

23  Channel  Modelling 

There  are  two  approaches  to  modelling  the  chatuiel.  Collected  statistical  data  can  be 
used  to  determine  the  parameters  of  the  channel  model  ("descriptive*  modelling)  or 
a  mathematical  model  can  be  adapted  to  real  measurements  of  the  charmel 
("generative"  modelling)  [3].  It  is  apparent  that  in  both  instances  substantial 
statistical  data  is  required  to  model  the  charmel  effectively. 

Measured  statistics  of  various  channels  are  dependant  on  the  type  of  equipment  used 
and  the  environment  in  which  the  charuiels  operate.  With  the  lack  of  available 
statistical  data  and  vast  amounts  of  different  charmel  equipment  and  environments 


found  in  the  proximity  of  the  trackside,  it  was  impractical  to  gather  sufficient 
statistical  data  to  model  all  the  channels  with  their  associated  equipment  and 
environment  It  was  thus  decided  to  make  the  following  assumptions: 

*  The  channel  has  no  memory  i.e.  one  bit  inverted  will  not  effect  the  next  bit. 

*  The  channel  is  a  binary  symmetrical  channel  (BSC). 

*  Errors  occur  randomly  and  each  bit  has  the  same  probability  of  being  incorrectly 
asserted. 

*  Messages  are  equiprobable  i.e.  each  message  has  the  same  probability  of 
containing  as  many  Ts"  as  "O’s". 

By  ignoring  the  effects  of  burst  noise  the  results  obtained  for  the  probability  of  error 
would  be  optimistic  as  illustrated  by  [7].  For  this  reason  a  random  error  detect- 
ing/conecting  code  with  good  random  and  burst  error  capabilities  will  be  used  to 
implement  the  error  coding  in  the  error  control  system.  Thereafter  techniques  to 
improves  the  code’s  burst  error  capabilities  will  be  implemented. 

2.4  Selecting  an  Error  Control  Code 

Error  control  codes  are  broken  up  into  two  main  types,  being  block  codes  and 
convolutional  codes.  In  this  application  convolutional  codes  were  not  considered  for 
reasons  mentioned  in  [1,8].  A  block  code  or  cyclic  code  with  good  random  and 
burst  error  detection  capabilities  will  be  selected  and  thereafter  the  codes’s  burst  error 
aipabilities  will  be  improved.  The  code  evaluation  will  be  based  on  the  probability 
of  an  undetected  error  P,(E). 

2.^.7  Error  Capabilities  of  Linear  Block  Codes 

The  error  control  capabilities  of  a  linear  block  code  used  for  error  detection  is 
determined  by  the  code’s  minimum  distance  (d^)  known  as  the  Hamming  distance 
[9,10,11].  For  a  (n,k)  linear  block  code,  there  are  (2“  •  1)  possible  undetected  errors. 
This  occurs  when  a  code-vector  is  corrupted  in  such  a  way  that  it  becomes  another 
valid  code-vector.  If  the  weight  (Hamming  Weight)  of  the  code  is  known,  it  is 
possible  to  calculate  the  probability  of  an  undetected  error  for  the  code,  if  it  is  used 
on  a  BSC.  This  can  be  calculated  using  equation  (1),  as  described  in  [11,  12, 
13]  as  follows: 

•■.(E)  =  E^p,(l-I>)■■'  (*) 

■•I 

where  A,  is  the  number  of  code  vectors  of  weight  i  in  the  code  and  p  is  the  charmel 
bit  error  rate  (BER).  For  large  values  of  n  and  k  however,  it  becomes  almost 
impossible  to  calculate  the  weight  distribution  of  a  code.  In  some  instances  however, 
it  is  possible  to  calculate  the  weight  distribution  of  the  dual  of  the  code.  By  using 
MacWilliams  identity  [14],  it  is  possible  to  calculate  the  probability  of  an  undetected 
error  for  the  code.  If  this  method  is  not  possible,  then  equation  (2)  can  be  used. 
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P,(E)  »  B(l-2p)  -  (1-p)*  where  B(l-2p)  «  5^Bi(l-2p)‘  (2) 

i-l 

where  (B«  is  the  weight  distribution  of  the  dual  code.  If  neither  of  the 

above  methods  are  possible  the  upper-bound  as  described  in  equation  (3). 

P,(E)  s  2-<'-'‘>  (3) 

where  Pa(E)  is  an  upper  bound  for  the  average  probability  of  an  undetected  error. 
It  must  be  emphasized  that  this  upper-bound  is  only  valid  for  a  few  codes  as 
illustrated  in  [15,16,17]. 

2.4.2  Code  Evaluation 

In  order  to  satisfy  the  system  safety  requirement  the  code  selected  must  improve  the 
channel  BER  from  10'^  to  1.32  x  10'^*  (with  a  transmission  rate  2400  bps).  By  using 
the  MacWilliams  identity  and  tables  in  [18,19],  it  was  found  that  certain  Bose, 
Chaudhuri  and  Hocquenghem  (BCH)  codes,  the  most  powerful  known  class  of  binary 
codes  for  correcting  random  errors  [20],  would  satisfy  the  safety  requirement 

2.4.3  Improving  a  Code’s  Burst  noise  Immunity 

As  a  result  of  the  assumptions  made  the  theoretical  values  for  P,(E)  will  be 
optimistic.  By  using  techniques  such  as  bit  stuffing,  interleaving,  concatenated  codes 
and  a  second  code  it  is  possible  to  improve  the  code’s  burst  error  capabilities.  Of 
these  methods  interleaving  offers  the  best  improvement  for  overhead  as  described  in 
[21,22]. 

2^  Error  Control  Implementation 

The  code  selected  in  this  application  is  a  interleaved  BCH(1S,7,2)  with  an  inter¬ 
leaving  degree  X  of  5.  In  this  application  a  modified  ARQ  communication  scheme 
is  used.  Retransmissions  are  not  required  because  the  latest  data  is  always  trans¬ 
mitted.  Received  data  that  is  corrupted  is  discarded.  Encoding  and  decoding  of  the 
data  is  done  by  a  lookup  table.  There  are  l'  valid  code-vectors  requiring  128  bytes 
of  memory.  The  time  taken  to  lookup  one  code-vector  is  ±  Aps  excluding  the 
software  overhead. 

3  Development  of  Fail-Safe  Data  Transceiver 

When  microprocessors  are  used  in  the  design  of  equipment  that  is  to  be  used  in  life- 
critical  applications,  it  is  not  feasible  to  incorporate  interlocking  to  detect  all  the 
possible  microprocessor  failure  modes.  To  overcome  this  problem,  the  system  is 
designed  to  have  either  a  voting  or  comparative  architecture.  In  both  approaches 
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designed  to  have  either  a  voting  or  comparative  architecture.  In  both  approaches 
similar  or  diverse  hardware  can  be  used.  In  tUs  section  we  present  the  formulas  to 
quantify  the  system  safety,  and,  discuss  the  FSDT. 

3.1  Determining  the  Hardware  Safety 

Hardware  safety  is  erqirressed  as  the  mean  time  between  wrong  side  failures 
(MTBWSF).  The  MTBWSF  is  derived  &om  the  mean  time  between  failures 
(MTBF)  of  the  conqronents  of  the  system.  We  will  consider  the  two-out-of-two 
comparative  architecture.  The  following  assumptions  are  made: 

•  components  ate  used  in  their  useful  life  period  (ie  random  failures  occur). 

•  all  first  faults  will  result  in  an  unsafe  failure  if  undetected. 

The  instantaneous  failure  rate  or  hazard  rate  of  a  component  is  expressed  as 

X(t)  «  ^  (4) 

where  f(t)  is  the  failure  probability  density  function  and  R(t)  is  the  reliability  of  the 
component.  When  using  components  durii^  their  useful  life,  X(t)  is  constant  and  is 
expressed  as  X.  The  mean  time  between  system  failure  (MTBSF)  as  calculated  in 
[23]  erqrressed  in  equation  (5)  where  X«fi  and  1/fi  is  the  mean  repair  time. 

MTBSF  »  JL  (5) 


Consider  a  two-processor  system  where  the  system  output  is  controlled  by  a 
redundant  management  system  which  has  the  ability  to  shut  the  system  down  to  a 
safe  state  should  faults  occur  in  either  of  the  processors.  The  following  assunoptions 
are  made: 

•  There  are  no  design  errors  in  the  system  which  will  render  the  redundant 
management  system  unable  to  detect  an  unsafe  failure. 

•  Failures  in  each  of  the  two  processors  are  independent 

•  Any  failure  occurring  after  the  first  undetected  failure  will  result  in  the 
maiugement  system  unable  to  revert  the  system  to  a  safe  state. 

The  system  safety  can  be  quantified  in  terms  of  the  MTBWSF  by  redefining  f,i(t)  as 
the  probability  density  function  of  the  fault  detection  process  which  has  a  fault 
detection  time  of  x.  The  safety  can  now  be  expressed  as 

MTBWSF  •  i  +  -L.  -  (6) 

X  2Xh  2x 

When  the  two  elements  are  configured  as  described  above  and  used  in  a  safe 
environment  the  two  redundant  elements  together  operate  as  a  single  element  and  the 


It  is  thus  possible  to  determine  the  safety  of  the  system  by  calculating  the  MTBF  of 
each  system  in  a  two-out-of-two  comparative  architecture.  The  MTBF  of  each 
system  is  derived  from  the  MTBF  of  the  individual  conqmnents  that  make  up  each 
system. 

3^  Selecting  the  FSDT  Architecture 

After  fault-tree  and  FMECA  analyses  were  performed  it  was  found  that  a  two-out-of- 
two  comparative  architecture  would  meet  the  FSDT  requirements.  Similar  as 
ofqrosed  to  diverse  hardware  was  selected.  An  illustration  of  the  architecture  is  given 
in  Figure  3. 

33  Description  of  the  FSDT  Hardware  Implementation 

The  FSDT  comprises  of  two  identical  sub-systems,  which  are  electrically  isolated 
from  each  other  to  obtain  statistical  independent  errors.  The  modules  within  the  sub¬ 
system  are  microprocessor  controlled  and  are  configured  in  a  master/slave 
configuration.  Each  sub-system  comprises  an  input,  output,  conuns  and  display 
module  as  illustrated  in  Figure  4.  Each  module  can  be  configured  as  a  master  or  a 
slave. 

The  slave  performs  only  its  primary  function  whereas  a  master  performs  its  primary 
function,  data  collection  and  distribution  and  rendezvous  with  the  other  sub-system, 
for  data.  The  two  sub-system  are  loosely  synchronised  and  re-adjustment  occurs  at 
each  rendezvous. 
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F^re  3:  Fail-Safe  Data  Transceiver  Architecture 

All  of  the  modules  in  both  sub-systems  have  the  ability  to  shut  the  system  down  on 
the  detection  of  an  error.  This  is  done  by  blowing  a  fuse  which  isolates  the  FSDT 
from  the  safe  system.  Once  the  FSDT  is  shutdown,  no  data  can  pass  between  itself 
and  the  safe  system.  The  serial  channels  can  still  be  used  to  transmit  maintenance 


information  depending  on  the  type  of  failure  that  caused  the  system  to  be  shut  down. 


Figure  4:  FSDT  Sub-System  Components 


4  Ensuring  "Error-Free"  Software 

The  FSDTS  that  is  under  evaluation  at  present  has  software  which  is  functionally 
correct.  By  this  is  meant  that  the  functions  performed  by  the  software  are  correct, 
but  the  software  has  not  been  verifled  and  has  not  been  proven  to  be  correct.  The 
software  was  written  in  assembler  and  no  intemipts  have  been  used.  The  validation 
and  verification  of  the  software  is  part  of  the  research  that  is  currently  being 
undertaken  at  the  University  of  Cape  Town.  The  following  issues  are  being 
addressed: 

•  Use  of  fault-tree  analysis  and  other  methods  to  identify  high  risk  areas  and 
to  provide  input  into  the  validation  process. 

•  Generation  of  a  meta  language  to  facilitate  the  formulation  of  the  system 
requirements  into  a  formal  qtecification. 

•  Using  Statecharts  to  model  and  represent  the  system  for  input  into  a 
synchronous  language  such  as  Esterel. 

5  Conclusions 

To  date  5  systems  have  been  manufactured  and  are  currently  being  tested  in  Cape 
Town.  Three  of  the  links  used  are  leased  telephone  type  circuits  and  the  fourth  is 
a  microwave  link.  The  telephone  links  are  4-15kffl  in  length  and  the  microwave  link 
is  ±  100km  in  length.  At  present  the  FSDTs  are  monitored  by  Test  Generation 
Modules  (TGMs)  and  safety  system  simulators  (SSS).  To  date  the  FSDTS  has 
operated  successfully  and  once  the  software  research  is  conqrleted  the  FSDTs  will 
be  used  in  live  applications.  At  this  stage  it  is  anticipated  that  the  systems  will  be 
ready  for  installation  early  in  1994. 
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The  world-wide  market  for  safe,  s'ecufe  and  reliable  computer 
systems  is  rapidly  expanding:  safety  is  one  of  the  top  priorities 
for  many  high  technology  applications.  Among  the  industrial 
and  business  sectors  which  are  especially  concerned  with 
safety  are:  aerospace,  manufacturing  and  machinery  control, 
water  treatment,  mining,  rail,  military,  medical,  power,  shipping, 
insurance,  certification,  and  standards  making. 

SAFECOMP  ’93  is  an  opportunity  for  technical  developers, 
users  and  legislators  to  exchange  and  review  their  experiences, 
to  consider  the  best  technologies  now  available,  and  to  identify 
the  skills  and  technologies  required  for  the  future.  It  focuses  on 
critical  computer  applications,  presenting  current  research  and 
new  trends  in  computer  safety,  reliability  and  security,  and 
providing  a  platform  for  technology  transfer  between  academia, 
industry  and  research  institutions.  It  is  outstanding  for  its 
internationai  breadth  (authors  from  1 6  different  countries),  its 
way  of  combining  participants  from  academia,  research  and 
industry,  and  its  wide  topical  coverage. 

This  book  presents  the  proceedings  of  SAFECOMP  ’93:  the 
1 2th  International  Conference  on  Computer  Safety,  Reliability 
and  Security,  held  In  Poznan,  Poland,  27-29  October  1 993.  The 
papers  cover  a  broad  spectrum  of  subjects  including  formal 
methods  and  models,  safety  assessment  and  analysis, 
verification  and  validation,  testing,  reliability  issues  and 
dependable  software  technology,  computer  languages  for 
safety  related  systems,  reactive  systems  technology,  security 
and  safety  related  applications. 

SAFECOMP  ’93  is  for  all  those  in  universities,  research 
institutions,  industry  and  business  who  want  to  be  well- 
informed  about  the  current  international  state  of  the  art  in 
computer  safety,  reliability  and  security.  It  provides  a 
representative  sample  of  recent  research  results  and 
applications  problems,  presented  by  experts  from  industrial  and 
academic  institutions. 
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